RESUMEN
BACKGROUND: High-density SNP arrays are now available for a wide range of crop species. Despite the development of many tools for generating genetic maps, the genome position of many SNPs from these arrays is unknown. Here we propose a linkage disequilibrium (LD)-based algorithm to allocate unassigned SNPs to chromosome regions from sparse genetic maps. This algorithm was tested on sugarcane, wheat, and barley data sets. We calculated the algorithm's efficiency by masking SNPs with known locations, then assigning their position to the map with the algorithm, and finally comparing the assigned and true positions. RESULTS: In the 20-fold cross-validation, the mean proportion of masked mapped SNPs that were placed by the algorithm to a chromosome was 89.53, 94.25, and 97.23% for sugarcane, wheat, and barley, respectively. Of the markers that were placed in the genome, 98.73, 96.45 and 98.53% of the SNPs were positioned on the correct chromosome. The mean correlations between known and new estimated SNP positions were 0.97, 0.98, and 0.97 for sugarcane, wheat, and barley. The LD-based algorithm was used to assign 5920 out of 21,251 unpositioned markers to the current Q208 sugarcane genetic map, representing the highest density genetic map for this species to date. CONCLUSIONS: Our LD-based approach can be used to accurately assign unpositioned SNPs to existing genetic maps, improving genome-wide association studies and genomic prediction in crop species with fragmented and incomplete genome assemblies. This approach will facilitate genomic-assisted breeding for many orphan crops that lack genetic and genomic resources.
Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Mapeo Cromosómico , Ligamiento Genético , Genotipo , Desequilibrio de Ligamiento , FitomejoramientoRESUMEN
BACKGROUND: Breeders and geneticists use statistical models to separate genetic and environmental effects on phenotype. A common way to separate these effects is to model a descriptor of an environment, a contemporary group or herd, and account for genetic relationship between animals across environments. However, separating the genetic and environmental effects in smallholder systems is challenging due to small herd sizes and weak genetic connectedness across herds. We hypothesised that accounting for spatial relationships between nearby herds can improve genetic evaluation in smallholder systems. Furthermore, geographically referenced environmental covariates are increasingly available and could model underlying sources of spatial relationships. The objective of this study was therefore, to evaluate the potential of spatial modelling to improve genetic evaluation in dairy cattle smallholder systems. METHODS: We performed simulations and real dairy cattle data analysis to test our hypothesis. We modelled environmental variation by estimating herd and spatial effects. Herd effects were considered independent, whereas spatial effects had distance-based covariance between herds. We compared these models using pedigree or genomic data. RESULTS: The results show that in smallholder systems (i) standard models do not separate genetic and environmental effects accurately, (ii) spatial modelling increases the accuracy of genetic evaluation for phenotyped and non-phenotyped animals, (iii) environmental covariates do not substantially improve the accuracy of genetic evaluation beyond simple distance-based relationships between herds, (iv) the benefit of spatial modelling was largest when separating the genetic and environmental effects was challenging, and (v) spatial modelling was beneficial when using either pedigree or genomic data. CONCLUSIONS: We have demonstrated the potential of spatial modelling to improve genetic evaluation in smallholder systems. This improvement is driven by establishing environmental connectedness between herds, which enhances separation of genetic and environmental effects. We suggest routine spatial modelling in genetic evaluations, particularly for smallholder systems. Spatial modelling could also have a major impact in studies of human and wild populations.
Asunto(s)
Cruzamiento/métodos , Bovinos/genética , Interacción Gen-Ambiente , Modelos Genéticos , Animales , EcosistemaRESUMEN
Sugarcane smut and Pachymetra root rots are two serious diseases of sugarcane, with susceptible infected crops losing over 30% of yield. A heritable component to both diseases has been demonstrated, suggesting selection could improve disease resistance. Genomic selection could accelerate gains even further, enabling early selection of resistant seedlings for breeding and clonal propagation. In this study we evaluated four types of algorithms for genomic predictions of clonal performance for disease resistance. These algorithms were: Genomic best linear unbiased prediction (GBLUP), including extensions to model dominance and epistasis, Bayesian methods including BayesC and BayesR, Machine learning methods including random forest, multilayer perceptron (MLP), modified convolutional neural network (CNN) and attention networks designed to capture epistasis across the genome-wide markers. Simple hybrid methods, that first used BayesR/GWAS to identify a subset of 1000 markers with moderate to large marginal additive effects, then used attention networks to derive predictions from these effects and their interactions, were also developed and evaluated. The hypothesis for this approach was that using a subset of markers more likely to have an effect would enable better estimation of interaction effects than when there were an extremely large number of possible interactions, especially with our limited data set size. To evaluate the methods, we applied both random five-fold cross-validation and a structured PCA based cross-validation that separated 4702 sugarcane clones (that had disease phenotypes and genotyped for 26k genome wide SNP markers) by genomic relationship. The Bayesian methods (BayesR and BayesC) gave the highest accuracy of prediction, followed closely by hybrid methods with attention networks. The hybrid methods with attention networks gave the lowest variation in accuracy of prediction across validation folds (and lowest MSE), which may be a criteria worth considering in practical breeding programs. This suggests that hybrid methods incorporating the attention mechanism could be useful for genomic prediction of clonal performance, particularly where non-additive effects may be important.
RESUMEN
Genomic selection in sugarcane faces challenges due to limited genomic tools and high genomic complexity, particularly because of its high and variable ploidy. The classification of genotypes for single nucleotide polymorphisms (SNPs) becomes difficult due to the wide range of possible allele dosages. Previous genomic studies in sugarcane used pseudo-diploid genotyping, grouping all heterozygotes into a single class. In this study, we investigate the use of continuous genotypes as a proxy for allele-dosage in genomic prediction models. The hypothesis is that continuous genotypes could better reflect allele dosage at SNPs linked to mutations affecting target traits, resulting in phenotypic variation. The dataset included genotypes of 1318 clones at 58K SNP markers, with about 26K markers filtered using standard quality controls. Predictions for tonnes of cane per hectare (TCH), commercial cane sugar (CCS), and fiber content (Fiber) were made using parametric, non-parametric, and Bayesian methods. Continuous genotypes increased accuracy by 5%-7% for CCS and Fiber. The pseudo-diploid parametrization performed better for TCH. Reproducing kernel Hilbert spaces model with Gaussian kernel and AK4 (arc-cosine kernel with hidden layer 4) kernel outperformed other methods for TCH and CCS, suggesting that non-additive effects might influence these traits. The prevalence of low-dosage markers in the study may have limited the benefits of approximating allele-dosage information with continuous genotypes in genomic prediction models. Continuous genotypes simplify genomic prediction in polyploid crops, allowing additional markers to be used without adhering to pseudo-diploid inheritance. The approach can particularly benefit high ploidy species or emerging crops with unknown ploidy.
Asunto(s)
Saccharum , Saccharum/genética , Teorema de Bayes , Genotipo , Fenotipo , GenómicaRESUMEN
Many thousands and, in some cases, millions of individuals from the major crop and livestock species have been genotyped and phenotyped for the purpose of genomic selection. 'Ultimate genotypes', in which the marker allele haplotypes with the most favorable effects on a target trait or traits in the population are combined together in silico, can be constructed from these datasets. Ultimate genotypes display up to six times the performance of the current best individuals in the population, as demonstrated for net profit in dairy cattle (incorporating a range of economic traits), yield in wheat and 100-seed weight in chickpea. However, current breeding strategies that aim to assemble ultimate genotypes through conventional crossing take many generations. As a hypothetical thought piece, here, we contemplate three future pathways for rapidly achieving ultimate genotypes: accelerated recombination with gene editing, direct editing of whole-genome haplotype sequences and synthetic biology.
RESUMEN
A major focus for genomic prediction has been on improving trait prediction accuracy using combinations of algorithms and the training data sets available from plant breeding multi-environment trials (METs). Any improvements in prediction accuracy are viewed as pathways to improve traits in the reference population of genotypes and product performance in the target population of environments (TPE). To realize these breeding outcomes there must be a positive MET-TPE relationship that provides consistency between the trait variation expressed within the MET data sets that are used to train the genome-to-phenome (G2P) model for applications of genomic prediction and the realized trait and performance differences in the TPE for the genotypes that are the prediction targets. The strength of this MET-TPE relationship is usually assumed to be high, however it is rarely quantified. To date investigations of genomic prediction methods have focused on improving prediction accuracy within MET training data sets, with less attention to quantifying the structure of the TPE and the MET-TPE relationship and their potential impact on training the G2P model for applications of genomic prediction to accelerate breeding outcomes for the on-farm TPE. We extend the breeder's equation and use an example to demonstrate the importance of the MET-TPE relationship as a key component for the design of genomic prediction methods to realize improved rates of genetic gain for the target yield, quality, stress tolerance and yield stability traits in the on-farm TPE.
RESUMEN
Sugarcane has a complex, highly polyploid genome with multi-species ancestry. Additive models for genomic prediction of clonal performance might not capture interactions between genes and alleles from different ploidies and ancestral species. As such, genomic prediction in sugarcane presents an interesting case for machine learning (ML) methods, which are purportedly able to deal with high levels of complexity in prediction. Here, we investigated deep learning (DL) neural networks, including multilayer networks (MLP) and convolution neural networks (CNN), and an ensemble machine learning approach, random forest (RF), for genomic prediction in sugarcane. The data set used was 2912 sugarcane clones, scored for 26,086 genome wide single nucleotide polymorphism markers, with final assessment trial data for total cane harvested (TCH), commercial cane sugar (CCS), and fiber content (Fiber). The clones in the latest trial (2017) were used as a validation set. We compared prediction accuracy of these methods to genomic best linear unbiased prediction (GBLUP) extended to include dominance and epistatic effects. The prediction accuracies from GBLUP models were up to 0.37 for TCH, 0.43 for CCS, and 0.48 for Fiber, while the optimized ML models had prediction accuracies of 0.35 for TCH, 0.38 for CCS, and 0.48 for Fiber. Both RF and DL neural network models have comparable predictive ability with the additive GBLUP model but are less accurate than the extended GBLUP model.
Asunto(s)
Saccharum , Saccharum/genética , Fitomejoramiento , Genómica/métodos , Aprendizaje Automático , PoliploidíaRESUMEN
Mate-allocation strategies in breeding programs can improve progeny performance by harnessing non-additive genetic effects. These approaches prioritise predicted progeny merit over parental breeding value, making them particularly appealing for clonally propagated crops such as sugarcane. We conducted a comparative analysis of mate-allocation strategies, exploring utilising non-additive and heterozygosity effects to maximise clonal performance with schemes that solely consider additive effects to optimise breeding value. Using phenotypic and genotypic data from a population of 2,909 clones evaluated in final assessment trials of Australian sugarcane breeding programs, we focused on three important traits: tonnes of cane per hectare (TCH), commercial cane sugar (CCS), and Fibre. By simulating families from all possible crosses (1,225) with 50 progenies each, we predicted the breeding and clonal values of progeny using two models: GBLUP (considering additive effects only) and extended-GBLUP (incorporating additive, non-additive, and heterozygosity effects). Integer linear programming was used to identify the optimal mate-allocation among selected parents. Compared to breeding value-based approaches, mate-allocation strategies based on clonal performance yielded substantial improvements, with predicted progeny values increasing by 57% for TCH, 12% for CCS, and 16% for fibre. Our simulation study highlights the effectiveness of mate-allocation approaches that exploit non-additive and heterozygosity effects, resulting in superior clonal performance. However, there was a notable decline in additive gain, particularly for TCH, likely due to significant epistatic effects. When selecting crosses based on clonal performance for TCH, the inbreeding coefficient of progeny was significantly lower compared to random mating, underscoring the advantages of leveraging non-additive and heterozygosity effects in mitigating inbreeding depression. Thus, mate-allocation strategies are recommended in clonally propagated crops to enhance clonal performance and reduce the negative impacts of inbreeding.
RESUMEN
Breeding has increased genetic gain for dairy cattle in advanced economies but has had limited success in improving dairy cattle in low- to middle-income countries (LMIC). Genetic evaluations are a central component of delivering genetic gain, because they separate the genetic and environmental effects of animals' phenotypes. Genetic evaluations have been successful in advanced economies because of large data sets and strong genetic connectedness, provided by the widespread use of artificial insemination (AI) and accurate recording of pedigree information. In smallholder dairy production systems of many LMICs, the limited use of AI and small herd sizes results in a data structure with insufficient genetic connectedness between herds to facilitate genetic evaluations based on pedigree. Genomic information keeps track of shared haplotypes rather than shared relatives captured by pedigree records. Therefore, genomic information could capture "hidden" genetic relationships, that are not captured by pedigree information, to strengthen genetic connectedness in LMIC smallholder dairy data sets. This study's objective was to use simulation to quantify the power of genomic information to enable genetic evaluation using LMIC smallholder dairy data sets. The results from this study show that (1) genetic evaluations using genomic information were more accurate than those using pedigree information in populations with a high effective population size and weak genetic connectedness; and (2) genetic evaluations modeling herd as a random effect had higher or equal accuracy than those modeling herd as a fixed effect. This demonstrates the potential of genomic information to be an enabling technology in LMIC smallholder dairy production systems by facilitating genetic evaluations with in situ records collected from herds of ≤4 cows. The establishment of routine genomic evaluations could allow the development of LMIC breeding programs comprising an informal set of nucleus animals distributed across many small herds within the target environment. These nucleus animals could be used for genetic evaluation, and the best animals could be disseminated to participating smallholder dairy farms. Together, this could increase the productivity, profitability, and sustainability of LMIC smallholder dairy production systems.
RESUMEN
Genomic prediction of complex traits across environments, breeding cycles, and populations remains a challenge for plant breeding. A potential explanation for this is that underlying non-additive genetic (GxG) and genotype-by-environment (GxE) interactions generate allele substitution effects that are non-stationary across different contexts. Such non-stationary effects of alleles are either ignored or assumed to be implicitly captured by most gene-to-phenotype (G2P) maps used in genomic prediction. The implicit capture of non-stationary effects of alleles requires the G2P map to be re-estimated across different contexts. We discuss the development and application of hierarchical G2P maps that explicitly capture non-stationary effects of alleles and have successfully increased short-term prediction accuracy in plant breeding. These hierarchical G2P maps achieve increases in prediction accuracy by allowing intermediate processes such as other traits and environmental factors and their interactions to contribute to complex trait variation. However, long-term prediction remains a challenge. The plant breeding community should undertake complementary simulation and empirical experiments to interrogate various hierarchical G2P maps that connect GxG and GxE interactions simultaneously. The existing genetic correlation framework can be used to assess the magnitude of non-stationary effects of alleles and the predictive ability of these hierarchical G2P maps in long-term, multi-context genomic predictions of complex traits in plant breeding.
RESUMEN
Hybrid vigour has the potential to substantially increase the yield of self-pollinating crops such as wheat and rice, but future hybrid performance may depend on the initial strategy to form heterotic pools. We used in silico stochastic simulation of future hybrid performance in a self-pollinating crop to evaluate three strategies of forming heterotic pools in the founder population. The model included either 500, 2000 or 8000 quantitative trait nucleotides (QTN) across 10 chromosomes that contributed to a quantitative trait with population mean 100 and variance 10. The average degree of dominance at each QTN was either 0.2, 0.4 or 0.8 with variance 0.2. Three strategies for splitting the founder population into two heterotic pools were compared: (i) random split; (ii) split based on genetic distance according to principal component analysis of SNP genotypes; and (iii) optimized split based on F1 hybrid performance in a diallel cross among the founders. Future hybrid performance was stochastically simulated over 30 cycles of reciprocal recurrent selection based on true genetic values for additive and dominance effects. The three strategies of forming heterotic pools produced similar future hybrid performance, and superior future hybrids to a control population selected on inbred line performance when the number of quantitative trait nucleotides was ≥2000 and/or the average degree of dominance was ≥0.4.