RESUMEN
KEY MESSAGE: Selection response in truncation selection across multiple sets of candidates hinges on their post-selection proportions, which can deviate grossly from their initial proportions. For BLUPs, using a uniform threshold for all candidates maximizes the selection response, irrespective of differences in population parameters. Plant breeding programs typically involve multiple families from either the same or different populations, varying in means, genetic variances and prediction accuracy of BLUPs or BLUEs for true genetic values (TGVs) of candidates. We extend the classical breeder's equation for truncation selection from single to multiple sets of genotypes, indicating that the expected overall selection response ( Δ G Tot ) for TGVs depends on the selection response within individual sets and their post-selection proportions. For BLUEs, we show that maximizing Δ G Tot requires thresholds optimally tailored for each set, contingent on their population parameters. For BLUPs, we prove that Δ G Tot is maximized by applying a uniform threshold across all candidates from all sets. We provide explicit formulas for the origin of the selected candidates from different sets and show that their proportions before and after selection can differ substantially, especially for sets with inferior properties and low proportion. We discuss implications of these results for (a) optimum allocation of resources to training and prediction sets and (b) the need to counteract narrowing the genetic variation under genomic selection. For genomic selection of hybrids based on BLUPs of GCA of their parent lines, selecting distinct proportions in the two parent populations can be advantageous, if these differ substantially in the variance and/or prediction accuracy of GCA. Our study sheds light on the complex interplay of selection thresholds and population parameters for the selection response in plant breeding programs, offering insights into the effective resource management and prudent application of genomic selection for improved crop development.
Asunto(s)
Fitomejoramiento , Selección Genética , Humanos , Fitomejoramiento/métodos , Genotipo , Plantas/genética , Genómica/métodos , Modelos Genéticos , FenotipoRESUMEN
KEY MESSAGE: Training sets produced by maximizing the number of parent lines, each involved in one cross, had the highest prediction accuracy for H0 hybrids, but lowest for H1 and H2 hybrids. Genomic prediction holds great promise for hybrid breeding but optimum composition of the training set (TS) as determined by the number of parents (nTS) and crosses per parent (c) has received little attention. Our objective was to examine prediction accuracy ([Formula: see text]) of GCA for lines used as parents of the TS (I1 lines) or not (I0 lines), and H0, H1 and H2 hybrids, comprising crosses of type I0 × I0, I1 × I0 and I1 × I1, respectively, as function of nTS and c. In the theory, we developed estimates for [Formula: see text] of GBLUPs for hybrids: (i)[Formula: see text] based on the expected prediction accuracy, and (ii) [Formula: see text] based on [Formula: see text] of GBLUPs of GCA and SCA effects. In the simulation part, hybrid populations were generated using molecular data from two experimental maize data sets. Additive and dominance effects of QTL borrowed from literature were used to simulate six scenarios of traits differing in the proportion (τSCA = 1%, 6%, 22%) of SCA variance in σG2 and heritability (h2 = 0.4, 0.8). Values of [Formula: see text] and [Formula: see text] closely agreed with [Formula: see text] for hybrids. For given size NTS = nTS × c of TS, [Formula: see text] of H0 hybrids and GCA of I0 lines was highest for c = 1. Conversely, for GCA of I1 lines and H1 and H2 hybrids, c = 1 yielded lowest [Formula: see text] with concordant results across all scenarios for both data sets. In view of these opposite trends, the optimum choice of c for maximizing selection response across all types of hybrids depends on the size and resources of the breeding program.
Asunto(s)
Genómica , Fitomejoramiento , Fenotipo , Genoma de Planta , Simulación por Computador , Modelos GenéticosRESUMEN
BACKGROUND: Disposable N95 respirator shortages during the COVID-19 and 2009 H1N1 influenza pandemics highlighted the need for reusable alternatives, such as elastomeric half-mask respirators (EHMRs). Two US medical organizations deployed reusable EHMRs during the COVID-19 response. In addition to wipe-based disinfection following patient care episodes expected per local policies at both organizations, postshift centralized cleaning and disinfection (C&D) was expected at one site (A), permitting shared-pool EHMR use, and optional at the other (Site B), where EHMRs were issued to individuals. Using a survey, we evaluated disinfection practices reported by EHMR users and predictors of disinfection behaviors and perceptions. METHODS: Surveys assessed EHMR disinfection practices, occupational characteristics, EHMR use frequency, training, and individual-issue versus shared-pool EHMR use. RESULTS: Of 1080 EHMR users completing the survey, 76% reported that they disinfect the EHMR after each patient encounter, which was the expected practice at both sites. Increasing EHMR use, recall of disinfection training, and work in intensive care or emergency settings significantly influenced higher reporting of this practice. 36% of respondents reported using centralized C&D, although reporting was higher at the site (A) where this was expected (53%). Confidence in cleanliness of the EHMR following centralized C&D was not influenced by individual versus shared-pool EHMR issue. CONCLUSIONS: Most EHMR users reported adherence with expected post-care individual-based disinfection of EHMRs but did not necessarily use standardized, centralized C&D. Future efforts to limit reliance on behavior related to respirator disinfection may improve EHMR implementation in healthcare to avert dependence on single-use, disposable N95 respirators.
Asunto(s)
COVID-19 , Subtipo H1N1 del Virus de la Influenza A , Dispositivos de Protección Respiratoria , Humanos , Desinfección , COVID-19/prevención & control , Ventiladores Mecánicos , Atención a la SaludRESUMEN
BACKGROUND: Single-step genomic best linear unbiased prediction (GBLUP) involves a joint analysis of individuals with genotype information, and their ancestors, descendants, or contemporaries, without recorded genotypes. Livestock applications typically represent populations with fewer individuals with genotypes relative to the number not genotyped. Most breeding programmes are structured, consisting of a nucleus tier in which selection drives genetic gains that are propagated through descendants that represent parents in multiplier and commercial tiers. In some cases, the genotypes in the nucleus tier are proprietary to a breeding company, and not publicly available for a whole industry analysis. Bayesian inference involves combining a defined description of prior information with new information to generate a posterior distribution that contains all available information on parameters of interest. A natural extension of Bayesian analysis would be to use information from the posterior distribution to define the prior distribution in a subsequent analysis. METHODS: We derive the mixed model equations for inference on breeding values for non genotyped individuals in that subset of the population that is of current interest, using only data on the performance of current individuals and their immediate pedigree, along with prior information defined from the posterior distribution of an external BLUP or single-step GBLUP analysis of the ancestors of the current population. DISCUSSION: Identical estimates of breeding values and their prediction error covariances for current animals of interest in the multiplier or commercial tier can be obtained without requiring neither the genomic relationship matrix nor genotypes of any of their ancestors in the nucleus tier, as can be obtained from a single analysis using pedigree, performance, and genomic information from all tiers. The Bayesian analysis of the current population does not require explicit information on unselected genotyped animals in the external population.
Asunto(s)
Genoma , Genómica , Animales , Teorema de Bayes , Genotipo , Genómica/métodos , Linaje , Modelos Genéticos , FenotipoRESUMEN
BACKGROUND: Selection schemes distort inference when estimating differences between treatments or genetic associations between traits, and may degrade prediction of outcomes, e.g., the expected performance of the progeny of an individual with a certain genotype. If input and output measurements are not collected on random samples, inferences and predictions must be biased to some degree. Our paper revisits inference in quantitative genetics when using samples stemming from some selection process. The approach used integrates the classical notion of fitness with that of missing data. Treatment is fully Bayesian, with inference and prediction dealt with, in an unified manner. While focus is on animal and plant breeding, concepts apply to natural selection as well. Examples based on real data and stylized models illustrate how selection can be accounted for in four different situations, and sometimes without success. RESULTS: Our flexible "soft selection" setting helps to diagnose the extent to which selection can be ignored. The clear connection between probability of missingness and the concept of fitness in stylized selection scenarios is highlighted. It is not realistic to assume that a fixed selection threshold t holds in conceptual replication, as the chance of selection depends on observed and unobserved data, and on unequal amounts of information over individuals, aspects that a "soft" selection representation addresses explicitly. There does not seem to be a general prescription to accommodate potential distortions due to selection. In structures that combine cross-sectional, longitudinal and multi-trait data such as in animal breeding, balance is the exception rather than the rule. The Bayesian approach provides an integrated answer to inference, prediction and model choice under selection that goes beyond the likelihood-based approach, where breeding values are inferred indirectly. CONCLUSIONS: The approach used here for inference and prediction under selection may or may not yield the best possible answers. One may believe that selection has been accounted for diligently, but the central problem of whether statistical inferences are good or bad does not have an unambiguous solution. On the other hand, the quality of predictions can be gauged empirically via appropriate training-testing of competing methods.
Asunto(s)
Genómica , Animales , Teorema de Bayes , Estudios Transversales , Funciones de Verosimilitud , FenotipoRESUMEN
BACKGROUND: Linkage disequilibrium (LD) is commonly measured based on the squared coefficient of correlation [Formula: see text] between the alleles at two loci that are carried by haplotypes. LD can also be estimated as the [Formula: see text] between unphased genotype dosage at two loci when the allele frequencies and inbreeding coefficients at both loci are identical for the parental lines. Here, we investigated whether [Formula: see text] for a crossbred population (F1) can be estimated using genotype data. The parental lines of the crossbred (F1) can be purebred or crossbred. METHODS: We approached this by first showing that inbreeding coefficients for an F1 crossbred population are negative, and typically differ in size between loci. Then, we proved that the expected [Formula: see text] computed from unphased genotype data is expected to be identical to the [Formula: see text] computed from haplotype data for an F1 crossbred population, regardless of the inbreeding coefficients at the two loci. Finally, we investigated the bias and precision of the [Formula: see text] estimated using unphased genotype versus haplotype data in stochastic simulation. RESULTS: Our findings show that estimates of [Formula: see text] based on haplotype and unphased genotype data are both unbiased for different combinations of allele frequencies, sample sizes (900, 1800, and 2700), and levels of LD. In general, for any allele frequency combination and [Formula: see text] value scenarios considered, and for both methods to estimate [Formula: see text], the precision of the estimates increased, and the bias of the estimates decreased as sample size increased, indicating that both estimators are consistent. For a given scenario, the [Formula: see text] estimates using haplotype data were more precise and less biased using haplotype data than using unphased genotype data. As sample size increased, the difference in precision and biasedness between the [Formula: see text] estimates using haplotype data and unphased genotype data decreased. CONCLUSIONS: Our theoretical derivations showed that estimates of LD between loci based on unphased genotypes and haplotypes in F1 crossbreds have identical expectations. Based on our simulation results, we conclude that the LD for an F1 crossbred population can be accurately estimated from unphased genotype data. The results also apply for other crosses (F2, F3, Fn, BC1, BC2, and BCn), as long as (selected) individuals from the two parental lines mate randomly.
Asunto(s)
Modelos Genéticos , Polimorfismo de Nucleótido Simple , Frecuencia de los Genes , Genotipo , Haplotipos , Humanos , Desequilibrio de LigamientoRESUMEN
Dense single nucleotide polymorphism (SNP) panels are widely used for genome-wide association studies (GWAS). In these panels, SNPs within a genomic segment tend to be highly correlated. Thus, association studies based on testing the significance of single SNPs are not very effective, and genomic-window based tests have been proposed to address this problem. However, when the SNP density on the genotype panel is not homogeneous, genomic-window based tests can lead to the detection of spurious associations by declaring effects of genomic windows that explain a large proportion of genetic variance as significant. We propose two methods to solve this problem.
Asunto(s)
Estudio de Asociación del Genoma Completo/métodos , Genotipo , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Animales , Ganado/genética , Sitios de Carácter CuantitativoRESUMEN
BACKGROUND: The possibility of using antibody response (S/P ratio) to PRRSV vaccination measured in crossbred commercial gilts as a genetic indicator for reproductive performance in vaccinated crossbred sows has motivated further studies of the genomic basis of this trait. In this study, we investigated the association of haplotypes and runs of homozygosity (ROH) and heterozygosity (ROHet) with S/P ratio and their impact on reproductive performance. RESULTS: There was no association (P-value ≥ 0.18) of S/P ratio with the percentage of ROH or ROHet, or with the percentage of heterozygosity across the whole genome or in the major histocompatibility complex (MHC) region. However, specific ROH and ROHet regions were significantly associated (P-value ≤ 0.01) with S/P ratio on chromosomes 1, 4, 5, 7, 10, 11, 13, and 17 but not (P-value ≥ 0.10) with reproductive performance. With the haplotype-based genome-wide association study (GWAS), additional genomic regions associated with S/P ratio were identified on chromosomes 4, 7, and 9. These regions harbor immune-related genes, such as SLA-DOB, TAP2, TAPBP, TMIGD3, and ADORA. Four haplotypes at the identified region on chromosome 7 were also associated with multiple reproductive traits. A haplotype significantly associated with S/P ratio that is located in the MHC region may be in stronger linkage disequilibrium (LD) with the quantitative trait loci (QTL) than the previously identified single nucleotide polymorphism (SNP) (H3GA0020505) given the larger estimate of genetic variance explained by the haplotype than by the SNP. CONCLUSIONS: Specific ROH and ROHet regions were significantly associated with S/P ratio. The haplotype-based GWAS identified novel QTL for S/P ratio on chromosomes 4, 7, and 9 and confirmed the presence of at least one QTL in the MHC region. The chromosome 7 region was also associated with reproductive performance. These results narrow the search for causal genes in this region and suggest SLA-DOB and TAP2 as potential candidate genes associated with S/P ratio on chromosome 7. These results provide additional opportunities for marker-assisted selection and genomic selection for S/P ratio as genetic indicator for litter size in commercial pig populations.
Asunto(s)
Virus del Síndrome Respiratorio y Reproductivo Porcino , Animales , Formación de Anticuerpos , Femenino , Estudio de Asociación del Genoma Completo , Genómica , Haplotipos , Sitios de Carácter Cuantitativo , Sus scrofa/genética , Porcinos/genética , VacunaciónRESUMEN
Empirical estimates of the accuracy of estimates of breeding values (EBV) can be obtained by cross-validation. Leave-one-out cross-validation (LOOCV) is an extreme case of k-fold cross-validation. Efficient strategies for LOOCV of predictions of phenotypes have been developed for a simple model with an overall mean and random marker or animal genetic effects. The objective here was to develop and evaluate an efficient LOOCV method for prediction of breeding values and other random effects under a general mixed linear model with multiple random effects. Conventional LOOCV of EBV requires inverting an (n-1)×(n-1) covariance matrix for each of n (= number of observations) data sets. Our efficient LOOCV obtains the required inverses from the inverse of the covariance matrix for all n observations. The efficient method can be applied to complex models with multiple fixed and random effects, but requires fixed effects to be treated as random, with large variances. An alternative is to precorrect observations using estimates of fixed effects obtained from the complete data, but this can lead to biases. The efficient LOOCV method was compared to conventional LOOCV of predictions of breeding values in terms of computational demands and accuracy. For a data set with 3,205 observations and a model with multiple random and fixed effects, the efficient LOOCV method was 962 times faster than the conventional LOOCV with precorrection for fixed effects based on each training data set but resulted in identical EBV. A computationally efficient LOOCV for prediction of breeding values for single- and multiple-trait mixed models with multiple fixed and random effects was successfully developed. The method enables cross-validation of predictions of breeding values and of any linear combination of random and/or fixed effects, along with leave-one-out precorrection of validation phenotypes.
Asunto(s)
Cruzamiento , Modelos Genéticos , Animales , Genotipo , Modelos Lineales , FenotipoRESUMEN
Modeling covariance structure based on genetic similarity between pairs of relatives plays an important role in evolutionary, quantitative and statistical genetics. Historically, genetic similarity between individuals has been quantified from pedigrees via the probability that randomly chosen homologous alleles between individuals are identical by descent (IBD). At present, however, many genetic analyses rely on molecular markers, with realized measures of genomic similarity replacing IBD-based expected similarities. Animal and plant breeders, for example, now employ marker-based genomic relationship matrices between individuals in prediction models and in estimation of genome-based heritability coefficients. Phenotypes convey information about genetic similarity as well. For instance, if phenotypic values are at least partially the result of the action of quantitative trait loci, one would expect the former to inform about the latter, as in genome-wide association studies. Statistically, a non-trivial conditional distribution of unknown genetic similarities, given phenotypes, is to be expected. A Bayesian formalism is presented here that applies to whole-genome regression methods where some genetic similarity matrix, e.g., a genomic relationship matrix, can be defined. Our Bayesian approach, based on phenotypes and markers, converts prior (markers only) expected similarity into trait-specific posterior similarity. A simulation illustrates situations under which effective Bayesian learning from phenotypes occurs. Pinus and wheat data sets were used to demonstrate applicability of the concept in practice. The methodology applies to a wide class of Bayesian linear regression models, it extends to the multiple-trait domain, and can also be used to develop phenotype-guided similarity kernels in prediction problems.
Asunto(s)
Estudio de Asociación del Genoma Completo , Modelos Genéticos , Sitios de Carácter Cuantitativo , Teorema de Bayes , Genotipo , Fenotipo , Pinus/genética , Polimorfismo de Nucleótido Simple , Triticum/genéticaRESUMEN
BACKGROUND: Bayesian regression models are widely used in genomic prediction, where the effects of all markers are estimated simultaneously by combining the information from the phenotypic data with priors for the marker effects and other parameters such as variance components or membership probabilities. Inferences from most Bayesian regression models are based on Markov chain Monte Carlo methods, where statistics are computed from a Markov chain constructed to have a stationary distribution that is equal to the posterior distribution of the unknown parameters. In practice, chains of tens of thousands steps are typically used in whole-genome Bayesian analyses, which is computationally intensive. METHODS: In this paper, we propose a fast parallelized algorithm for Bayesian regression models called independent intensive Bayesian regression models (BayesXII, "X" stands for Bayesian alphabet methods and "II" stands for "parallel") and show how the sampling of each marker effect can be made independent of samples for other marker effects within each step of the chain. This is done by augmenting the marker covariate matrix by adding p (the number of markers) new rows such that columns of the augmented marker covariate matrix are orthogonal. Ideally, the computations at each step of the MCMC chain can be accelerated by k times, where k is the number of computer processors, up to p times, where p is the number of markers. RESULTS: We demonstrate the BayesXII algorithm using the prior for BayesC[Formula: see text], a Bayesian variable selection regression method, which is applied to simulated data with 50,000 individuals and a medium-density marker panel ([Formula: see text] 50,000 markers). To reach about the same accuracy as the conventional samplers for BayesC[Formula: see text] required less than 30 min using the BayesXII algorithm on 24 nodes (computer used as a server) with 24 cores on each node. In this case, the BayesXII algorithm required one tenth of the computation time of conventional samplers for BayesC[Formula: see text]. Addressing the heavy computational burden associated with Bayesian methods by parallel computing will lead to greater use of these methods.
Asunto(s)
Algoritmos , Simulación por Computador , Genoma/genética , Modelos Estadísticos , AnimalesRESUMEN
KEY MESSAGE: Covering a subset of individuals with a quantitative predictor, while imputing records for all others using pedigree or genomic data, could improve the precision of predictions while controlling for costs. Predicting genetic values with high accuracy is pivotal for effective candidate selection in animal and plant breeding. Novel 'omics'-based predictors have been shown to improve upon established genome-based predictions of important complex traits but require laborious and expensive assays. As a consequence, there are various datasets with full genetic marker coverage of all studied individuals but incomplete coverage with other 'omics' data. In animal breeding, single-step prediction was introduced to efficiently combine pedigree information, collected on a large number of animals, with genomic information, collected on a smaller subset of animals, for breeding value estimation without bias. Using two maize datasets of inbred lines and hybrids, we show that the single-step framework facilitates imputing transcriptomic data, boosting forecasts when their predictive ability exceeds that of pedigree or genomic data. Our results suggest that covering only a subset of inbred lines with 'omics' predictors and imputing all others using pedigree or genomic data could enable breeders to improve trait predictions while keeping costs under control. Employing 'omics' predictors could particularly improve candidate selection in hybrid breeding because the success of forecasts is a strongly convex function of predictive ability.
Asunto(s)
Genómica/métodos , Zea mays/genética , Genotipo , Hibridación Genética , Endogamia , Sitios de Carácter Cuantitativo/genéticaRESUMEN
Vitamin A is essential for human health, but current intake levels in many developing countries such as India are too low due to malnutrition. According to the World Health Organization, an estimated 250 million preschool children are vitamin A deficient globally. This number excludes pregnant women and nursing mothers, who are particularly vulnerable. Efforts to improve access to vitamin A are key because supplementation can reduce mortality rates in young children in developing countries by around 23%. Three key genes, BCMO1, BCO2, and SCARB1, have been shown to be associated with the amount of ß-carotene (BC) in milk. Whole-genome sequencing reads from the coordinates of these 3 genes in 202 non-Indian cattle (141 Bos taurus, 61 Bos indicus) and 35 non-Indian buffalo (Bubalus bubalis) animals from several breeds were collected from data repositories. The number of SNP detected in the coding regions of these 3 genes ranged from 16 to 26 in the 3 species, with 5 overlapping SNP between B. taurus and B. indicus. All these SNP together with 2 SNP in the upstream part of the gene but already present in dbSNP (https://www.ncbi.nlm.nih.gov/projects/SNP/) were used to build a custom Sequenom array. Blood for DNA and milk samples for BC were obtained from 2,291 Indian cows of 5 different breeds (Gir, Holstein cross, Jersey Cross, Tharparkar, and Sahiwal) and 2,242 Indian buffaloes (Jafarabadi, Murrah, Pandharpuri, and Surti breeds). The DNA was extracted and genotyped with the Sequenom array. For each individual breed and the combined breeds, SNP with an association that had a P-value <0.3 in the first round of linear analysis were included in a second step of regression analyses to determine allele substitution effects to increase the content of BC in milk. Additionally, an F-test for all SNP within gene was performed with the objective of determining if overall the gene had a significant effect on the content of BC in milk. The analyses were repeated using a Bayesian approach to compare and validate the previous frequentist results. Multiple significant SNP were found using both methodologies with allele substitution effects ranging from 6.21 (3.13) to 9.10 (5.43) µg of BC per 100 mL of milk. Total gene effects exceeded the mean BC value for all breeds with both analysis approaches. The custom panel designed for genes related to BC production demonstrated applicability in genotyping of cattle and buffalo in India and may be used for cattle or buffalo from other developing countries. Moreover, the recommendation of selection for significant specific alleles of some gene markers provides a route to effectively increase the BC content in milk in the Indian cattle and buffalo populations.
Asunto(s)
Búfalos/genética , Bovinos/genética , Marcadores Genéticos , Leche/química , beta Caroteno/análisis , Alelos , Animales , Femenino , Genotipo , India , Polimorfismo de Nucleótido Simple , Embarazo , Especificidad de la Especie , beta Caroteno/genéticaRESUMEN
A curious result from mixed linear models applied to genome-wide association studies was expanded. In particular, a model in which one or more markers are considered as fixed but are allowed to contribute to the covariance structure by treating such markers as random as well was examined. The best linear unbiased estimator of marker effects is invariant with respect to whether those markers are employed in constructing a genomic relationship matrix or are ignored, provided marker effects are uncorrelated with those not being tested. Also, the implications of regarding some marker effects as fixed when, in fact, these possess a non-trivial covariance structure with those declared as random were examined.
Asunto(s)
Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Modelos Lineales , Modelos Genéticos , Modelos Estadísticos , Animales , Cruzamiento , Genoma/genética , Genómica , Polimorfismo de Nucleótido SimpleRESUMEN
BACKGROUND: Population stratification and cryptic relationships have been the main sources of excessive false-positives and false-negatives in population-based association studies. Many methods have been developed to model these confounding factors and minimize their impact on the results of genome-wide association studies. In most of these methods, a two-stage approach is applied where: (1) methods are used to determine if there is a population structure in the sample dataset and (2) the effects of population structure are corrected either by modeling it or by running a separate analysis within each sub-population. The objective of this study was to evaluate the impact of population structure on the accuracy and power of genome-wide association studies using a Bayesian multiple regression method. METHODS: We conducted a genome-wide association study in a stochastically simulated admixed population. The genome was composed of six chromosomes, each with 1000 markers. Fifteen segregating quantitative trait loci contributed to the genetic variation of a quantitative trait with heritability of 0.30. The impact of genetic relationships and breed composition (BC) on three analysis methods were evaluated: single marker simple regression (SMR), single marker mixed linear model (MLM) and Bayesian multiple-regression analysis (BMR). Each method was fitted with and without BC. Accuracy, power, false-positive rate and the positive predictive value of each method were calculated and used for comparison. RESULTS: SMR and BMR, both without BC, were ranked as the worst and the best performing approaches, respectively. Our results showed that, while explicit modeling of genetic relationships and BC is essential for models SMR and MLM, BMR can disregard them and yet result in a higher power without compromising its false-positive rate. CONCLUSIONS: This study showed that the Bayesian multiple-regression analysis is robust to population structure and to relationships among study subjects and performs better than a single marker mixed linear model approach.
Asunto(s)
Mapeo Cromosómico/veterinaria , Variación Genética , Estudio de Asociación del Genoma Completo/métodos , Carácter Cuantitativo Heredable , Animales , Teorema de Bayes , Cruzamiento , Genética de Población , Tamaño del Genoma , Modelos Lineales , Modelos Genéticos , Densidad de PoblaciónRESUMEN
BACKGROUND: Genomic predictions from BayesA and BayesB use training data that include animals with both phenotypes and genotypes. Single-step methodologies allow additional information from non-genotyped relatives to be included in the analysis. The single-step genomic best linear unbiased prediction (SSGBLUP) method uses a relationship matrix computed from marker and pedigree information, in which missing genotypes are imputed implicitly. Single-step Bayesian regression (SSBR) extends SSGBLUP to BayesB-like models using explicitly imputed genotypes for non-genotyped individuals. METHODS: Carcass records included 988 genotyped Hanwoo steers with 35,882 SNPs and 1438 non-genotyped steers that were measured for back-fat thickness (BFT), carcass weight (CWT), eye-muscle area, and marbling score (MAR). Single-trait pedigree-based BLUP, Bayesian methods using only genotyped individuals, SSGBLUP and SSBR methods were compared using cross-validation. RESULTS: Methods using genomic information always outperformed pedigree-based BLUP when the same phenotypic data were modeled from either genotyped individuals only or both genotyped and non-genotyped individuals. For BFT and MAR, accuracies were higher with single-step methods than with BayesB, BayesC and BayesCπ. Gains in accuracy with the single-step methods ranged from +0.06 to +0.09 for BFT and from +0.05 to +0.07 for MAR. For CWT, SSBR always outperformed the corresponding Bayesian methods that used only genotyped individuals. However, although SSGBLUP incorporated information from non-genotyped individuals, prediction accuracies were lower with SSGBLUP than with BayesC (π = 0.9999) and BayesB (π = 0.98) for CWT because, for this particular trait, there was a benefit from the mixture priors of the effects of the single nucleotide polymorphisms. CONCLUSIONS: Single-step methods are the preferred approaches for prediction combining genotyped and non-genotyped animals. Alternative priors allow SSBR to outperform SSGBLUP in some cases.
Asunto(s)
Genoma , Genómica , Genotipo , Modelos Genéticos , Carácter Cuantitativo Heredable , Animales , Teorema de Bayes , Bovinos , Estudios de Asociación Genética , Estudio de Asociación del Genoma Completo , Genómica/métodos , Modelos Estadísticos , Fenotipo , Reproducibilidad de los ResultadosRESUMEN
BACKGROUND: Traditional genomic prediction models using multiple regression on single nucleotide polymorphisms (SNPs) genotypes exploit associations between genotypes of quantitative trait loci (QTL) and SNPs, which can be created by historical linkage disequilibrium (LD), recent co-segregation (CS) and pedigree relationships. Results from field data analyses show that prediction accuracy is usually much higher for individuals that are close relatives of the training population than for distantly related individuals. A possible reason is that historical LD between QTL and SNPs is weak and, for close relatives, prediction accuracy of SNP models is mainly contributed by pedigree relationships and CS. Information from pedigree relationships decreases fast over generations and only contributes to within-family prediction. Information from CS is affected by family structures and effective population size, and can have a substantial contribution to prediction accuracy when modeled explicitly. RESULTS: In this study, a method to explicitly model CS was developed by following the transmission of putative QTL alleles using allele origins at SNPs. Bayesian hierarchical models that combine information from LD and CS (LD-CS model) were developed for genomic prediction in pedigree populations. Contributions of LD and CS information to prediction accuracy across families and generations without retraining were investigated in simulated half-sib datasets and deep pedigrees with different recent effective population sizes, respectively. Results from half-sib datasets showed that when historical LD between QTL and SNPs is low, accuracy of the LD model decreased when the training data size is increased by adding independent sire families, but accuracies from the CS and LD-CS models increased and plateaued rapidly. Results from deep pedigree datasets show that the LD model had high accuracy across generations only when historical LD between QTL and SNPs was high. Modeling CS explicitly resulted in higher accuracy than the LD model across generations when the mating design generated many close relatives. CONCLUSIONS: Our results suggest that modeling CS explicitly improves accuracy of genomic prediction when historical LD between QTL and SNPs is low. Modeling both LD and CS explicitly is expected to improve accuracy when recent effective population size is small, or when the training data include many independent families.
Asunto(s)
Segregación Cromosómica , Genómica/métodos , Desequilibrio de Ligamiento , Modelos Genéticos , Alelos , Teorema de Bayes , Simulación por Computador , Predicción , Linaje , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , Carácter Cuantitativo HeredableRESUMEN
BACKGROUND: The mixed linear model employed for genomic best linear unbiased prediction (GBLUP) includes the breeding value for each animal as a random effect that has a mean of zero and a covariance matrix proportional to the genomic relationship matrix ([Formula: see text]), where the inverse of [Formula: see text] is required to set up the usual mixed model equations (MME). When only some animals have genomic information, genomic predictions can be obtained by an extension known as single-step GBLUP, where the covariance matrix of breeding values is constructed by combining the pedigree-based additive relationship matrix with [Formula: see text]. The inverse of the combined relationship matrix can be obtained efficiently, provided [Formula: see text] can be inverted. In some livestock species, however, the number [Formula: see text] of animals with genomic information exceeds the number of marker covariates used to compute [Formula: see text], and this results in a singular [Formula: see text]. For such a case, an efficient and exact method to obtain GBLUP and single-step GBLUP is presented here. RESULTS: Exact methods are already available to obtain GBLUP when [Formula: see text] is singular, but these require working with large dense matrices. Another approach is to modify [Formula: see text] to make it nonsingular by adding a small value to all its diagonals or regressing it towards the pedigree-based relationship matrix. This, however, results in the inverse of [Formula: see text] being dense and difficult to compute as [Formula: see text] grows. The approach presented here recognizes that the number r of linearly independent genomic breeding values cannot exceed the number of marker covariates, and the mixed linear model used here for genomic prediction only fits these r linearly independent breeding values as random effects. CONCLUSIONS: The exact method presented here was compared to Apy-GBLUP and to Apy single-step GBLUP, both of which are approximate methods that use a modified [Formula: see text] that has a sparse inverse which can be computed efficiently. In a small numerical example, predictions from the exact approach and Apy were almost identical, but the MME from Apy had a condition number about 1000 times larger than that from the exact approach, indicating ill-conditioning of the MME from Apy. The practical application of exact SSGBLUP is not more difficult than implementation of Apy.
Asunto(s)
Genómica/métodos , Modelos Lineales , Modelos Genéticos , Animales , Simulación por Computador , Genoma , Ganado/genética , Linaje , Selección Artificial/genéticaRESUMEN
BACKGROUND: Two types of models have been used for single-step genomic prediction and genome-wide association studies that include phenotypes from both genotyped animals and their non-genotyped relatives. The two types are breeding value models (BVM) that fit breeding values explicitly and marker effects models (MEM) that express the breeding values in terms of the effects of observed or imputed genotypes. MEM can accommodate a wider class of analyses, including variable selection or mixture model analyses. The order of the equations that need to be solved and the inverses required in their construction vary widely, and thus the computational effort required depends upon the size of the pedigree, the number of genotyped animals and the number of loci. THEORY: We present computational strategies to avoid storing large, dense blocks of the MME that involve imputed genotypes. Furthermore, we present a hybrid model that fits a MEM for animals with observed genotypes and a BVM for those without genotypes. The hybrid model is computationally attractive for pedigree files containing millions of animals with a large proportion of those being genotyped. APPLICATION: We demonstrate the practicality on both the original MEM and the hybrid model using real data with 6,179,960 animals in the pedigree with 4,934,101 phenotypes and 31,453 animals genotyped at 40,214 informative loci. To complete a single-trait analysis on a desk-top computer with four graphics cards required about 3 h using the hybrid model to obtain both preconditioned conjugate gradient solutions and 42,000 Markov chain Monte-Carlo (MCMC) samples of breeding values, which allowed making inferences from posterior means, variances and covariances. The MCMC sampling required one quarter of the effort when the hybrid model was used compared to the published MEM. CONCLUSIONS: We present a hybrid model that fits a MEM for animals with genotypes and a BVM for those without genotypes. Its practicality and considerable reduction in computing effort was demonstrated. This model can readily be extended to accommodate multiple traits, multiple breeds, maternal effects, and additional random effects such as polygenic residual effects.
Asunto(s)
Teorema de Bayes , Biología Computacional , Modelos Genéticos , Análisis de Regresión , Algoritmos , Animales , Simulación por ComputadorRESUMEN
BACKGROUND: Genomic estimated breeding values (GEBV) based on single nucleotide polymorphism (SNP) genotypes are widely used in animal improvement programs. It is typically assumed that the larger the number of animals is in the training set, the higher is the prediction accuracy of GEBV. The aim of this study was to quantify genomic prediction accuracy depending on the number of ancestral generations included in the training set, and to determine the optimal number of training generations for different traits in an elite layer breeding line. METHODS: Phenotypic records for 16 traits on 17,793 birds were used. All parents and some selection candidates from nine non-overlapping generations were genotyped for 23,098 segregating SNPs. An animal model with pedigree relationships (PBLUP) and the BayesB genomic prediction model were applied to predict EBV or GEBV at each validation generation (progeny of the most recent training generation) based on varying numbers of immediately preceding ancestral generations. Prediction accuracy of EBV or GEBV was assessed as the correlation between EBV and phenotypes adjusted for fixed effects, divided by the square root of trait heritability. The optimal number of training generations that resulted in the greatest prediction accuracy of GEBV was determined for each trait. The relationship between optimal number of training generations and heritability was investigated. RESULTS: On average, accuracies were higher with the BayesB model than with PBLUP. Prediction accuracies of GEBV increased as the number of closely-related ancestral generations included in the training set increased, but reached an asymptote or slightly decreased when distant ancestral generations were used in the training set. The optimal number of training generations was 4 or more for high heritability traits but less than that for low heritability traits. For less heritable traits, limiting the training datasets to individuals closely related to the validation population resulted in the best predictions. CONCLUSIONS: The effect of adding distant ancestral generations in the training set on prediction accuracy differed between traits and the optimal number of necessary training generations is associated with the heritability of traits.