RESUMEN
Missing heritability in genome-wide association studies defines a major problem in genetic analyses of complex biological traits1,2. The solution to this problem is to identify all causal genetic variants and to measure their individual contributions3,4. Here we report a graph pangenome of tomato constructed by precisely cataloguing more than 19 million variants from 838 genomes, including 32 new reference-level genome assemblies. This graph pangenome was used for genome-wide association study analyses and heritability estimation of 20,323 gene-expression and metabolite traits. The average estimated trait heritability is 0.41 compared with 0.33 when using the single linear reference genome. This 24% increase in estimated heritability is largely due to resolving incomplete linkage disequilibrium through the inclusion of additional causal structural variants identified using the graph pangenome. Moreover, by resolving allelic and locus heterogeneity, structural variants improve the power to identify genetic factors underlying agronomically important traits leading to, for example, the identification of two new genes potentially contributing to soluble solid content. The newly identified structural variants will facilitate genetic improvement of tomato through both marker-assisted selection and genomic selection. Our study advances the understanding of the heritability of complex traits and demonstrates the power of the graph pangenome in crop breeding.
Asunto(s)
Variación Genética , Genoma de Planta , Estudio de Asociación del Genoma Completo , Fitomejoramiento , Solanum lycopersicum , Alelos , Productos Agrícolas/genética , Genoma de Planta/genética , Desequilibrio de Ligamiento , Solanum lycopersicum/genética , Solanum lycopersicum/metabolismoRESUMEN
Genomic selection (GS) has emerged as an effective technology to accelerate crop hybrid breeding by enabling early selection prior to phenotype collection. Genomic best linear unbiased prediction (GBLUP) is a robust method that has been routinely used in GS breeding programs. However, GBLUP assumes that markers contribute equally to the total genetic variance, which may not be the case. In this study, we developed a novel GS method called GA-GBLUP that leverages the genetic algorithm (GA) to select markers related to the target trait. We defined four fitness functions for optimization, including AIC, BIC, R2, and HAT, to improve the predictability and bin adjacent markers based on the principle of linkage disequilibrium to reduce model dimension. The results demonstrate that the GA-GBLUP model, equipped with R2 and HAT fitness function, produces much higher predictability than GBLUP for most traits in rice and maize datasets, particularly for traits with low heritability. Moreover, we have developed a user-friendly R package, GAGBLUP, for GS, and the package is freely available on CRAN (https://CRAN.R-project.org/package=GAGBLUP).
Asunto(s)
Algoritmos , Genómica , Selección Genética , Zea mays , Genómica/métodos , Zea mays/genética , Oryza/genética , Modelos Genéticos , Fitomejoramiento/métodos , Desequilibrio de Ligamiento , Fenotipo , Sitios de Carácter Cuantitativo , Genoma de Planta , Polimorfismo de Nucleótido Simple , Programas InformáticosRESUMEN
BACKGROUND: Traditional recombinant inbred lines (RILs) are generated from repeated self-fertilization or brother-sister mating from the F1 hybrid of two inbred parents. Compared with the F2 population, RILs cumulate more crossovers between loci and thus increase the number of recombinants, resulting in an increased resolution of genetic mapping. Since they are inbred to the isogenic stage, another consequence of the heterozygosity reduction is the increased genetic variance and thus the increased power of QTL detection. Self-fertilization is the primary form of developing RILs in plants. Brother-sister mating is another way to develop RILs but in small laboratory animals. To ensure that the RILs have at least 98% of homozygosity, we need about seven generations of self-fertilization or 20 generations of brother-sister mating. Prior to homozygosity, these lines are called pre-recombinant inbred lines (PRERIL). Phenotypic values of traits in PRERILs are often collected but not used in QTL mapping. To perform QTL mapping in PRERILs, we need the recombination fraction between two markers at generation t for t < 7 (selfing) or t < 20 (brother-sister mating) so that the genotypes of QTL flanked by the markers can be inferred. RESULTS: In this study, we developed formulas to calculate the recombination fractions of PRERILs at generation t in self-fertilization, brother-sister mating, and random mating. In contrast to existing works in this topic, we used computer code to construct the transition matrix to form the Markov chain of genotype array between consecutive generations, the so-called recurrent equations. CONCLUSIONS: We provide R functions to calculate the recombination fraction using the newly developed recurrent equations of ordered genotype array. With the recurrent equations and the R code, users can perform QTL mapping in PRERILs. Substantial time and effort can be saved compared with QTL mapping in RILs.
Asunto(s)
Endogamia , Sitios de Carácter Cuantitativo , Recombinación Genética , Mapeo Cromosómico , Homocigoto , Modelos Genéticos , Genotipo , FenotipoRESUMEN
Maize is one of the major crops that has demonstrated success in the utilization of heterosis. Developing high-yield hybrids is a crucial part of plant breeding to secure global food demand. In this study, we conducted a genome-wide association study (GWAS) for 10 agronomic traits using a typical breeder population comprised 442 single-cross hybrids by evaluating additive, dominance, and epistatic effects. A total of 49 significant single nucleotide polymorphisms (SNPs) and 69 significant pairs of epistasis were identified, explaining 26.2% to 64.3% of the phenotypic variation across the 10 traits. The enrichment of favorable genotypes is significantly correlated to the corresponding phenotype. In the confident region of the associated site, 532 protein-coding genes were discovered. Among these genes, the Zm00001d044211 candidate gene was found to negatively regulate starch synthesis and potentially impact yield. This typical breeding population provided a valuable resource for dissecting the genetic architecture of yield-related traits. We proposed a novel mating strategy to increase the GWAS efficiency without utilizing more resources. Finally, we analyzed the enrichment of favorable alleles in the Shaan A and Shaan B groups, as well as in each inbred line. Our breeding practice led to consistent results. Not only does this study demonstrate the feasibility of GWAS in F1 hybrid populations, it also provides a valuable basis for further molecular biology and breeding research.
Asunto(s)
Estudio de Asociación del Genoma Completo , Zea mays , Zea mays/genética , Fitomejoramiento , Agricultura , Productos AgrícolasRESUMEN
Detecting quantitative trait loci (QTL) and estimating QTL variances (represented by the squared QTL effects) are two main goals of QTL mapping and genome-wide association studies (GWAS). However, there are issues associated with estimated QTL variances and such issues have not attracted much attention from the QTL mapping community. Estimated QTL variances are usually biased upwards due to estimation being associated with significance tests. The phenomenon is called the Beavis effect. However, estimated variances of QTL without significance tests can also be biased upwards, which cannot be explained by the Beavis effect; rather, this bias is due to the fact that QTL variances are often estimated as the squares of the estimated QTL effects. The parameters are the QTL effects and the estimated QTL variances are obtained by squaring the estimated QTL effects. This square transformation failed to incorporate the errors of estimated QTL effects into the transformation. The consequence is biases in estimated QTL variances. To correct the biases, we can either reformulate the QTL model by treating the QTL effect as random and directly estimate the QTL variance (as a variance component) or adjust the bias by taking into account the error of the estimated QTL effect. A moment method of estimation has been proposed to correct the bias. The method has been validated via Monte Carlo simulation studies. The method has been applied to QTL mapping for the 10-week-body-weight trait from an F2 mouse population.
Asunto(s)
Estudio de Asociación del Genoma Completo , Sitios de Carácter Cuantitativo , Animales , Mapeo Cromosómico/métodos , Ratones , Modelos Genéticos , Método de Montecarlo , Sitios de Carácter Cuantitativo/genéticaRESUMEN
Hybrid breeding has been shown to effectively increase rice productivity. However, identifying desirable hybrids out of numerous potential combinations is a daunting challenge. Genomic selection holds great promise for accelerating hybrid breeding by enabling early selection before phenotypes are measured. With the recent advances in multi-omic technologies, hybrid prediction based on transcriptomic and metabolomic data has received increasing attention. However, the current omic-based hybrid prediction has ignored parental phenotypic information, which is of fundamental importance in plant breeding. In this study, we integrated parental phenotypic information into various multi-omic prediction models applied in hybrid breeding of rice and compared the predictabilities of 15 combinations from four sets of predictors from the parents, that is genome, transcriptome, metabolome and phenome. The predictability for each combination was evaluated using the best linear unbiased prediction and a modified fast HAT method. We found significant interactions between predictors and traits in predictability, but joint prediction with various combinations of the predictors significantly improved predictability relative to prediction of any single source omic data for each trait investigated. Incorporation of parental phenotypic data into various omic predictors increased the predictability, averagely by 13.6%, 54.5%, 19.9% and 8.3%, for grain yield, number of tillers per plant, number of grains per panicle and 1000 grain weight, respectively. Among nine models of incorporating parental traits, the AD-All model was the most effective one. This novel strategy of incorporating parental phenotypic data into multi-omic prediction is expected to improve hybrid breeding progress, especially with the development of high-throughput phenotyping technologies.
Asunto(s)
Oryza , Hibridación Genética , Modelos Genéticos , Oryza/genética , Fenotipo , FitomejoramientoRESUMEN
MOTIVATION: Genome-wide association studies (GWAS) are still the primary steps toward gene discovery. The urgency is more obvious in the big data era when GWAS are conducted simultaneously for thousand traits, e.g. transcriptomic and metabolomic traits. Efficient mixed model association (EMMA) and genome-wide efficient mixed model association (GEMMA) are the widely used methods for GWAS. An algorithm with high computational efficiency is badly needed. It is interesting to note that the test statistics of the ordinary ridge regression (ORR) have the same patterns across the genome as those obtained from the EMMA method. However, ORR has never been used for GWAS due to its severe shrinkage on the estimated effects and the test statistics. RESULTS: We introduce a degree of freedom for each marker effect obtained from ORR and use it to deshrink both the estimated effect and the standard error so that the Wald test of ORR is brought back to the same level as that of EMMA. The new method is called deshrinking ridge regression (DRR). By evaluating the methods under three different model sizes (small, medium and large), we demonstrate that DRR is more generalized for all model sizes than EMMA, which only works for medium and large models. Furthermore, DRR detect all markers in a simultaneous manner instead of scanning one marker at a time. As a result, the computational time complexity of DRR is much simpler than EMMA and about m (number of genetic variants) times simpler than that of GEMMA when the sample size is way smaller than the number of markers. CONTACT: shizhong.xu@ucr.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Algoritmos , Fenotipo , Tamaño de la MuestraRESUMEN
SUMMARY: We have developed a rapid mixed model algorithm for exhaustive genome-wide epistatic association analysis by controlling multiple polygenic effects. Our model can simultaneously handle additive by additive epistasis, dominance by dominance epistasis and additive by dominance epistasis, and account for intrasubject fluctuations due to individuals with repeated records. Furthermore, we suggest a simple but efficient approximate algorithm, which allows the examination of all pairwise interactions in a remarkably fast manner of linear with population size. Simulation studies are performed to investigate the properties of REMMAX. Application to publicly available yeast and human data has showed that our mixed model-based method has similar performance with simple linear model on computational efficiency. It took less than 40 h for the pairwise analysis of 5000 individuals genotyped with roughly 350 000 SNPs with five threads on Intel Xeon E5 2.6 GHz CPU. AVAILABILITY AND IMPLEMENTATION: Source codes are freely available at https://github.com/chaoning/GMAT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Epistasis Genética , Herencia Multifactorial , Algoritmos , Estudio de Asociación del Genoma Completo , Humanos , Herencia Multifactorial/genética , Programas InformáticosRESUMEN
Accurately predicting the phenotypes of complex traits is crucial to enhanced breeding in plants and livestock, and to enhanced medicine in humans. Here we reports the first study accurately predicting complex traits using their contributing genes, especially their number of favorable alleles (NFAs), genotypes and transcript expressions, with the grain yield of maize, Zea mays L. When the NFAs or genotypes of only 27 SNP/InDel-containing grain yield genes were used, a prediction accuracy of râ¯=â¯0.52 or 0.49 was obtained. When the expressions of grain yield gene transcripts were used, a plateaued prediction accuracy of râ¯=â¯0.84 was achieved. When the phenotypes predicted with two or three of the genic datasets were used for progeny selection, the selected lines were completely consistent with those selected by phenotypic selection. Therefore, the genes controlling complex traits enable accurately predicting their phenotypes, thus desirable for gene-based breeding in crop plants.
Asunto(s)
Grano Comestible/genética , Genes de Plantas , Fitomejoramiento/métodos , Zea mays/genética , Alelos , Expresión Génica , Genotipo , Herencia Multifactorial , FenotipoRESUMEN
Hybrid breeding is the main strategy for improving productivity in many crops, especially in rice and maize. Genomic hybrid breeding is a technology that uses whole-genome markers to predict future hybrids. Predicted superior hybrids are then field evaluated and released as new hybrid cultivars after their superior performances are confirmed. This will increase the opportunity of selecting true superior hybrids with minimum costs. Here, we used genomic best linear unbiased prediction to perform hybrid performance prediction using an existing rice population of 1495 hybrids. Replicated 10-fold cross-validations showed that the prediction abilities on ten agronomic traits ranged from 0.35 to 0.92. Using the 1495 rice hybrids as a training sample, we predicted six agronomic traits of 100 hybrids derived from half diallel crosses involving 21 parents that are different from the parents of the hybrids in the training sample. The prediction abilities were relatively high, varying from 0.54 (yield) to 0.92 (grain length). We concluded that the current population of 1495 hybrids can be used to predict hybrids from seemingly unrelated parents. Eventually, we used this training population to predict all potential hybrids of cytoplasm male sterile lines from 3000 rice varieties from the 3K Rice Genome Project. Using a breeding index combining 10 traits, we identified the top and bottom 200 predicted hybrids. SNP genotypes of the training population and parameters estimated from this training population are available for general uses and further validation in genomic hybrid prediction of all potential hybrids generated from all varieties of rice.
Asunto(s)
Hibridación Genética , Oryza/genética , Fitomejoramiento , Productos Agrícolas/genética , Genoma de Planta , Genómica , Modelos Genéticos , Polimorfismo de Nucleótido SimpleRESUMEN
MOTIVATION: Genomic scanning approaches that detect one locus at a time are subject to many problems in genome-wide association studies and quantitative trait locus mapping. The problems include large matrix inversion, over-conservativeness for tests after Bonferroni correction and difficulty in evaluation of the total genetic contribution to a trait's variance. Targeting these problems, we take a further step and investigate a multiple locus model that detects all markers simultaneously in a single model. RESULTS: We developed a sparse Bayesian learning (SBL) method for quantitative trait locus mapping and genome-wide association studies. This new method adopts a coordinate descent algorithm to estimate parameters (marker effects) by updating one parameter at a time conditional on current values of all other parameters. It uses an L2 type of penalty that allows the method to handle extremely large sample sizes (>100 000). Simulation studies show that SBL often has higher statistical powers and the simulated true loci are often detected with extremely small P-values, indicating that SBL is insensitive to stringent thresholds in significance testing. AVAILABILITY AND IMPLEMENTATION: An R package (sbl) is available on the comprehensive R archive network (CRAN) and https://github.com/MeiyueComputBio/sbl/tree/master/R%20packge. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Estudio de Asociación del Genoma Completo , Genómica , Teorema de Bayes , Mapeo Cromosómico , Modelos Genéticos , FenotipoRESUMEN
MOTIVATION: Current dynamic phenotyping system introduces time as an extra dimension to genome-wide association studies (GWAS), which helps to explore the mechanism of dynamical genetic control for complex longitudinal traits. However, existing methods for longitudinal GWAS either ignore the covariance among observations of different time points or encounter computational efficiency issues. RESULTS: We herein developed efficient genome-wide multivariate association algorithms for longitudinal data. In contrast to existing univariate linear mixed model analyses, the proposed method has improved statistic power for association detection and computational speed. In addition, the new method can analyze unbalanced longitudinal data with thousands of individuals and more than ten thousand records within a few hours. The corresponding time for balanced longitudinal data is just a few minutes. AVAILABILITY AND IMPLEMENTATION: A software package to implement the efficient algorithm named GMA (https://github.com/chaoning/GMA) is available freely for interested users in relevant fields. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Estudio de Asociación del Genoma Completo , Genoma , Humanos , Análisis Multivariante , Programas InformáticosRESUMEN
SUMMARY: We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. AVAILABILITY AND IMPLEMENTATION: GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Estudio de Asociación del Genoma Completo , Programas Informáticos , ComputadoresRESUMEN
Linear mixed models (LMM) that tests trait association one marker at a time have been the most popular methods for genome-wide association studies. However, this approach has potential pitfalls: over conservativeness after Bonferroni correction, ignorance of linkage disequilibrium (LD) between neighboring markers, and power reduction due to overfitting SNP effects. So, multiple locus models that can simultaneously estimate and test all markers in the genome are more appropriate. Based on the multiple locus models, we proposed a bin model that combines markers into bins based on their LD relationships. A bin is treated as a new synthetic marker and we detect the associations between bins and traits. Since the number of bins can be substantially smaller than the number of markers, a penalized multiple regression method can be adopted by fitting all bins to a single model. We developed an innovative method to bin the neighboring markers and used the least absolute shrinkage and selection operator (LASSO) method. We compared BIN-Lasso with SNP-Lasso and Q + K-LMM in a simulation experiment, and showed that the new method is more powerful with less Type I error than the other two methods. We also applied the bin model to a Chinese Simmental beef cattle population for bone weight association study. The new method identified more significant associations than the classical LMM. The bin model is a new dimension reduction technique that takes advantage of biological information (i.e., LD). The new method will be a significant breakthrough in associative genomics in the big data era.
Asunto(s)
Bovinos/genética , Estudios de Asociación Genética/veterinaria , Genómica/métodos , Modelos Genéticos , Animales , Simulación por Computador , Genotipo , Modelos Lineales , Desequilibrio de Ligamiento , Polimorfismo de Nucleótido SimpleRESUMEN
Multi-parent advanced generation inter-cross (MAGIC) populations are an emerging type of resource for dissecting the genetic structure of traits and improving breeding populations. We developed a MAGIC population for cowpea (Vigna unguiculata L. Walp.) from eight founder parents. These founders were genetically diverse and carried many abiotic and biotic stress resistance, seed quality and agronomic traits relevant to cowpea improvement in the United States and sub-Saharan Africa, where cowpea is vitally important in the human diet and local economies. The eight parents were inter-crossed using structured matings to ensure that the population would have balanced representation from each parent, followed by single-seed descent, resulting in 305 F8 recombinant inbred lines each carrying a mosaic of genome blocks contributed by all founders. This was confirmed by single nucleotide polymorphism genotyping with the Illumina Cowpea Consortium Array. These lines were on average 99.74% homozygous but also diverse in agronomic traits across environments. Quantitative trait loci (QTLs) were identified for several parental traits. Loci with major effects on photoperiod sensitivity and seed size were also verified by biparental genetic mapping. The recombination events were concentrated in telomeric regions. Due to its broad genetic base, this cowpea MAGIC population promises breakthroughs in genetic gain, QTL and gene discovery, enhancement of breeding populations and, for some lines, direct releases as new varieties.
Asunto(s)
Genes de Plantas/genética , Fitomejoramiento/métodos , Sitios de Carácter Cuantitativo/genética , Vigna/genética , Mapeo Cromosómico , Cromosomas de las Plantas/genética , Cruzamientos Genéticos , Genética de Población , Genoma de Planta/genética , Genotipo , Filogenia , Polimorfismo de Nucleótido Simple , Semillas/genética , Especificidad de la Especie , Vigna/clasificaciónRESUMEN
Genomic prediction (GP) aims to construct a statistical model for predicting phenotypes using genome-wide markers and is a promising strategy for accelerating molecular plant breeding. However, current progress of phenotype prediction using genomic data alone has reached a bottleneck, and previous studies on transcriptomic and metabolomic predictions ignored genomic information. Here, we designed a novel strategy of GP called multilayered least absolute shrinkage and selection operator (MLLASSO) by integrating multiple omic data into a single model that iteratively learns three layers of genetic features (GFs) supervised by observed transcriptome and metabolome. Significantly, MLLASSO learns higher order information of gene interactions, which enables us to achieve a significant improvement of predictability of yield in rice from 0.1588 (GP alone) to 0.2451 (MLLASSO). In the prediction of the first two layers, some genes were found to be genetically predictable genes (GPGs) as their expressions were accurately predicted with genetic markers. Interestingly, we made three dramatic discoveries for the GPGs: (i) GPGs are good predictors for highly complex traits like yield; (ii) GPGs are mostly eQTL genes (cis or trans); and (iii) trait-related transcriptional factor families are enriched in GPGs. These findings support the notion that learned GFs not only are good predictors for traits but also have specific biological implications regarding regulation of gene expressions. To differentiate the new method from conventional GP models, we called MLLASSO a directed learning strategy supervised by intermediate omic data. This new prediction model appears to be more reliable and more robust than conventional GP models.
Asunto(s)
Genómica/métodos , Oryza/genética , Aprendizaje Automático Supervisado , Marcadores Genéticos , Metaboloma , Modelos Genéticos , Modelos Estadísticos , Fenotipo , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , TranscriptomaRESUMEN
Motivation: Joint reconstruction of multiple gene regulatory networks (GRNs) using gene expression data from multiple tissues/conditions is very important for understanding common and tissue/condition-specific regulation. However, there are currently no computational models and methods available for directly constructing such multiple GRNs that not only share some common hub genes but also possess tissue/condition-specific regulatory edges. Results: In this paper, we proposed a new graphic Gaussian model for joint reconstruction of multiple gene regulatory networks (JRmGRN), which highlighted hub genes, using gene expression data from several tissues/conditions. Under the framework of Gaussian graphical model, JRmGRN method constructs the GRNs through maximizing a penalized log likelihood function. We formulated it as a convex optimization problem, and then solved it with an alternating direction method of multipliers (ADMM) algorithm. The performance of JRmGRN was first evaluated with synthetic data and the results showed that JRmGRN outperformed several other methods for reconstruction of GRNs. We also applied our method to real Arabidopsis thaliana RNA-seq data from two light regime conditions in comparison with other methods, and both common hub genes and some conditions-specific hub genes were identified with higher accuracy and precision. Availability and implementation: JRmGRN is available as a R program from: https://github.com/wenpingd. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Redes Reguladoras de Genes , Algoritmos , Funciones de Verosimilitud , Distribución Normal , Programas InformáticosRESUMEN
Motivation: Epistasis provides a feasible way for probing potential genetic mechanism of complex traits. However, time-consuming computation challenges successful detection of interaction in practice, especially when linear mixed model (LMM) is used to control type I error in the presence of population structure and cryptic relatedness. Results: A rapid epistatic mixed-model association analysis (REMMA) method was developed to overcome computational limitation. This method first estimates individuals' epistatic effects by an extended genomic best linear unbiased prediction (EG-BLUP) model with additive and epistatic kinship matrix, then pairwise interaction effects are obtained by linear retransformations of individuals' epistatic effects. Simulation studies showed that REMMA could control type I error and increase statistical power in detecting epistatic QTNs in comparison with existing LMM-based FaST-LMM. We applied REMMA to two real datasets, a mouse dataset and the Wellcome Trust Case Control Consortium (WTCCC) data. Application to the mouse data further confirmed the performance of REMMA in controlling type I error. For the WTCCC data, we found most epistatic QTNs for type 1 diabetes (T1D) located in a major histocompatibility complex (MHC) region, from which a large interacting network with 12 hub genes (interacting with ten or more genes) was established. Availability and implementation: Our REMMA method can be freely accessed at https://github.com/chaoning/REMMA. Contact: liujf@cau.edu.cn. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Epistasis Genética , Estudio de Asociación del Genoma Completo/métodos , Modelos Genéticos , Polimorfismo de Nucleótido Simple , Animales , Genómica/métodos , Humanos , RatonesRESUMEN
KEY MESSAGE: This paper combined GWAS, meta-analysis and sequence homology comparison with common bean to identify regions associated with seed size variation in domesticated cowpea. Seed size is an important trait for yield and commercial value in dry-grain cowpea. Seed size varies widely among different cowpea accessions, and the genetic basis of such variation is not yet well understood. To better decipher the genetic basis of seed size, a genome-wide association study (GWAS) and meta-analysis were conducted on a panel of 368 cowpea diverse accessions from 51 countries. Four traits, including seed weight, length, width and density were evaluated across three locations. Using 51,128 single nucleotide polymorphisms covering the cowpea genome, 17 loci were identified for these traits. One locus was common to weight, width and length, suggesting pleiotropy. By integrating synteny-based analysis with common bean, six candidate genes (Vigun05g036000, Vigun05g039600, Vigun05g204200, Vigun08g217000, Vigun11g187000, and Vigun11g191300) which are implicated in multiple functional categories related to seed size such as endosperm development, embryo development, and cell elongation were identified. These results suggest that a combination of GWAS meta-analysis with synteny comparison in a related plant is an efficient approach to identify candidate gene (s) for complex traits in cowpea. The identified loci and candidate genes provide useful information for improving cowpea varieties and for molecular investigation of seed size.
Asunto(s)
Semillas/fisiología , Vigna/genética , Mapeo Cromosómico , Genes de Plantas , Estudios de Asociación Genética , Genotipo , Fenotipo , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , Vigna/fisiologíaRESUMEN
Power calculation prior to a genetic experiment can help investigators choose the optimal sample size to detect a quantitative trait locus (QTL). Without the guidance of power analysis, an experiment may be underpowered or overpowered. Either way will result in wasted resource. QTL mapping and genome-wide association studies (GWAS) are often conducted using a linear mixed model (LMM) with controls of population structure and polygenic background using markers of the whole genome. Power analysis for such a mixed model is often conducted via Monte Carlo simulations. In this study, we derived a non-centrality parameter for the Wald test statistic for association, which allows analytical power analysis. We show that large samples are not necessary to detect a biologically meaningful QTL, say explaining 5% of the phenotypic variance. Several R functions are provided so that users can perform power analysis to determine the minimum sample size required to detect a given QTL with a certain statistical power or calculate the statistical power with given sample size and known values of other population parameters.