Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 310
Filtrar
1.
Cell ; 179(3): 589-603, 2019 10 17.
Artigo em Inglês | MEDLINE | ID: mdl-31607513

RESUMO

Genome-wide association studies (GWASs) have focused primarily on populations of European descent, but it is essential that diverse populations become better represented. Increasing diversity among study participants will advance our understanding of genetic architecture in all populations and ensure that genetic research is broadly applicable. To facilitate and promote research in multi-ancestry and admixed cohorts, we outline key methodological considerations and highlight opportunities, challenges, solutions, and areas in need of development. Despite the perception that analyzing genetic data from diverse populations is difficult, it is scientifically and ethically imperative, and there is an expanding analytical toolbox to do it well.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Técnicas de Genotipagem/métodos , Genética Humana/métodos , Confiabilidade dos Dados , Variação Genética , Genética Populacional/métodos , Genética Populacional/normas , Estudo de Associação Genômica Ampla/normas , Técnicas de Genotipagem/normas , Genética Humana/normas , Humanos , Linhagem
2.
Trends Genet ; 37(10): 868-871, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-34183185

RESUMO

For identification of marker-trait associations (MTAs) for complex traits in animals and plants, thousands of genome-wide association studies (GWAS) were conducted during the past two decades. This involved regular improvement in methodology. Initially, a reference genome and SNPs were used; more recently pan-genomes and the markers structural variations (SVs)/k-mers are also being used.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Animais , Genoma/genética , Humanos , Fenótipo , Plantas/genética , Polimorfismo de Nucleotídeo Único/genética
3.
Genome Res ; 31(4): 529-537, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33536225

RESUMO

Low-pass sequencing (sequencing a genome to an average depth less than 1× coverage) combined with genotype imputation has been proposed as an alternative to genotyping arrays for trait mapping and calculation of polygenic scores. To empirically assess the relative performance of these technologies for different applications, we performed low-pass sequencing (targeting coverage levels of 0.5× and 1×) and array genotyping (using the Illumina Global Screening Array [GSA]) on 120 DNA samples derived from African- and European-ancestry individuals that are part of the 1000 Genomes Project. We then imputed both the sequencing data and the genotyping array data to the 1000 Genomes Phase 3 haplotype reference panel using a leave-one-out design. We evaluated overall imputation accuracy from these different assays as well as overall power for GWAS from imputed data and computed polygenic risk scores for coronary artery disease and breast cancer using previously derived weights. We conclude that low-pass sequencing plus imputation, in addition to providing a substantial increase in statistical power for genome-wide association studies, provides increased accuracy for polygenic risk prediction at effective coverages of ∼0.5× and higher compared to the Illumina GSA.


Assuntos
Estudo de Associação Genômica Ampla , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Genoma Humano , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Haplótipos , Humanos , Fatores de Risco
4.
Genet Sel Evol ; 56(1): 59, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39174924

RESUMO

BACKGROUND: Single-nucleotide polymorphism (SNP) effects can be backsolved from ssGBLUP genomic estimated breeding values (GEBV) and used for genome-wide association studies (ssGWAS). However, obtaining p-values for those SNP effects relies on the inversion of dense matrices, which poses computational limitations in large genotyped populations. In this study, we present a method to approximate SNP p-values for ssGWAS with many genotyped animals. This method relies on the combination of a sparse approximation of the inverse of the genomic relationship matrix ( G A P Y - 1 ) built with the algorithm for proven and young ( APY ) and an approximation of the prediction error variance of SNP effects which does not require the inversion of the left-hand side (LHS) of the mixed model equations. To test the proposed p-value computing method, we used a reduced genotyped population of 50K genotyped animals and compared the approximated SNP p-values with benchmark p-values obtained with the direct inverse of LHS built with an exact genomic relationship matrix ( G - 1 ) . Then, we applied the proposed approximation method to obtain SNP p-values for a larger genotyped population composed of 450K genotyped animals. RESULTS: The same genomic regions on chromosomes 7 and 20 were identified across all p-value computing methods when using 50K genotyped animals. In terms of computational requirements, obtaining p-values with the proposed approximation reduced the wall-clock time by 38 times and the memory requirement by ten times compared to using the exact inversion of the LHS. When the approximation was applied to a population of 450K genotyped animals, two new significant regions on chromosomes 6 and 14 were uncovered, indicating an increase in GWAS detection power when including more genotypes in the analyses. The process of obtaining p-values with the approximation and 450K genotyped individuals took 24.5 wall-clock hours and 87.66GB of memory, which is expected to increase linearly with the addition of noncore genotyped individuals. CONCLUSIONS: With the proposed method, obtaining p-values for SNP effects in ssGWAS is computationally feasible in large genotyped populations. The computational cost of obtaining p-values in ssGWAS may no longer be a limitation in extensive populations with many genotyped animals.


Assuntos
Algoritmos , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Animais , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Genótipo , Modelos Genéticos , Cruzamento/métodos , Marcadores Genéticos
5.
PLoS Genet ; 17(12): e1009944, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34941872

RESUMO

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline ("HBimpute") that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.


Assuntos
Estudo de Associação Genômica Ampla/normas , Técnicas de Genotipagem , Haplótipos/genética , Software , Variações do Número de Cópias de DNA/genética , Genoma/genética , Genômica/métodos , Genótipo , Polimorfismo de Nucleotídeo Único/genética , Sequenciamento Completo do Genoma , Zea mays/genética
6.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34459489

RESUMO

In genome-wide association studies (GWAS), it has become commonplace to test millions of single-nucleotide polymorphisms (SNPs) for phenotypic association. Gene-based testing can improve power to detect weak signal by reducing multiple testing and pooling signal strength. While such tests account for linkage disequilibrium (LD) structure of SNP alleles within each gene, current approaches do not capture LD of SNPs falling in different nearby genes, which can induce correlation of gene-based test statistics. We introduce an algorithm to account for this correlation. When a gene's test statistic is independent of others, it is assessed separately; when test statistics for nearby genes are strongly correlated, their SNPs are agglomerated and tested as a locus. To provide insight into SNPs and genes driving association within loci, we develop an interactive visualization tool to explore localized signal. We demonstrate our approach in the context of weakly powered GWAS for autism spectrum disorder, which is contrasted to more highly powered GWAS for schizophrenia and educational attainment. To increase power for these analyses, especially those for autism, we use adaptive $P$-value thresholding, guided by high-dimensional metadata modeled with gradient boosted trees, highlighting when and how it can be most useful. Notably our workflow is based on summary statistics.


Assuntos
Algoritmos , Biologia Computacional/métodos , Predisposição Genética para Doença , Testes Genéticos/normas , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Alelos , Mapeamento Cromossômico , Bases de Dados Genéticas , Testes Genéticos/métodos , Humanos , Desequilíbrio de Ligação , Fenótipo , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas
7.
PLoS Comput Biol ; 18(1): e1009628, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-35025869

RESUMO

Genome-wide association studies rely on the statistical inference of untyped variants, called imputation, to increase the coverage of genotyping arrays. However, the results are often suboptimal in populations underrepresented in existing reference panels and array designs, since the selected single nucleotide polymorphisms (SNPs) may fail to capture population-specific haplotype structures, hence the full extent of common genetic variation. Here, we propose to sequence the full genomes of a small subset of an underrepresented study cohort to inform the selection of population-specific add-on tag SNPs and to generate an internal population-specific imputation reference panel, such that the remaining array-genotyped cohort could be more accurately imputed. Using a Tanzania-based cohort as a proof-of-concept, we demonstrate the validity of our approach by showing improvements in imputation accuracy after the addition of our designed add-on tags to the base H3Africa array.


Assuntos
Genética Populacional , Estudo de Associação Genômica Ampla , Genótipo , Polimorfismo de Nucleotídeo Único/genética , Biologia Computacional/métodos , Genética Populacional/métodos , Genética Populacional/normas , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Humanos , Masculino , Tanzânia
8.
PLoS Genet ; 16(4): e1008720, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32310995

RESUMO

Horizontal integration of summary statistics from different GWAS traits can be used to evaluate evidence for their shared genetic causality. One popular method to do this is a Bayesian method, coloc, which is attractive in requiring only GWAS summary statistics and no linkage disequilibrium estimates and is now being used routinely to perform thousands of comparisons between traits. Here we show that while most users do not adjust default software values, misspecification of prior parameters can substantially alter posterior inference. We suggest data driven methods to derive sensible prior values, and demonstrate how sensitivity analysis can be used to assess robustness of posterior inference. The flexibility of coloc comes at the expense of an unrealistic assumption of a single causal variant per trait. This assumption can be relaxed by stepwise conditioning, but this requires external software and an LD matrix aligned to study alleles. We have now implemented conditioning within coloc, and propose a new alternative method, masking, that does not require LD and approximates conditioning when causal variants are independent. Importantly, masking can be used in combination with conditioning where allelically aligned LD estimates are available for only a single trait. We have implemented these developments in a new version of coloc which we hope will enable more informed choice of priors and overcome the restriction of the single causal variant assumptions in coloc analysis.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Desequilíbrio de Ligação , Estudo de Associação Genômica Ampla/normas , Humanos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas
9.
PLoS Genet ; 16(12): e1009060, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-33320851

RESUMO

Gene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Anotação de Sequência Molecular/métodos , Algoritmos , Estudo de Associação Genômica Ampla/normas , Humanos , Anotação de Sequência Molecular/normas , Polimorfismo Genético , Locos de Características Quantitativas , Reprodutibilidade dos Testes
10.
PLoS Genet ; 16(8): e1008927, 2020 08.
Artigo em Inglês | MEDLINE | ID: mdl-32797036

RESUMO

The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.


Assuntos
Negro ou Afro-Americano/genética , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Transcriptoma , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/normas , Estudo de Associação Genômica Ampla/normas , Humanos , Locos de Características Quantitativas , RNA-Seq/métodos , RNA-Seq/normas , Padrões de Referência
11.
Mol Genet Genomics ; 297(1): 33-46, 2022 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-34755217

RESUMO

Based on molecular markers, genomic prediction enables us to speed up breeding schemes and increase the response to selection. There are several high-throughput genotyping platforms able to deliver thousands of molecular markers for genomic study purposes. However, even though its widely applied in plant breeding, species without a reference genome cannot fully benefit from genomic tools and modern breeding schemes. We used a method to assemble a population-tailored mock genome to call single-nucleotide polymorphism (SNP) markers without an available reference genome, and for the first time, we compared the results with standard genotyping platforms (array and genotyping-by-sequencing (GBS) using a reference genome) for performance in genomic prediction models. Our results indicate that using a population-tailored mock genome to call SNP delivers reliable estimates for the genomic relationship between genotypes. Furthermore, genomic prediction estimates were comparable to standard approaches, especially when considering only additive effects. However, mock genomes were slightly worse than arrays at predicting traits influenced by dominance effects, but still performed as well as standard GBS methods that use a reference genome. Nevertheless, the array-based SNP markers methods achieved the best predictive ability and reliability to estimate variance components. Overall, the mock genomes can be a worthy alternative for genomic selection studies, especially for those species where the reference genome is not available.


Assuntos
Biologia Computacional , Técnicas de Genotipagem , Modelos Genéticos , Animais , Quimera/genética , Biologia Computacional/métodos , Biologia Computacional/normas , Conjuntos de Dados como Assunto , Genoma , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Genômica/métodos , Genômica/normas , Genótipo , Técnicas de Genotipagem/métodos , Técnicas de Genotipagem/normas , Fenótipo , Padrões de Referência , Reprodutibilidade dos Testes , Seleção Genética , Especificidade da Espécie , Zea mays/classificação , Zea mays/genética
12.
Circ Res ; 126(12): 1816-1840, 2020 06 05.
Artigo em Inglês | MEDLINE | ID: mdl-32496918

RESUMO

Genome-wide association studies have revolutionized our understanding of the genetic underpinnings of cardiometabolic disease. Yet, the inadequate representation of individuals of diverse ancestral backgrounds in these studies may undercut their ultimate potential for both public health and precision medicine. The goal of this review is to describe the imperativeness of studying the populations who are most affected by cardiometabolic disease, to the aim of better understanding the genetic underpinnings of the disease. We support this premise by describing the current variation in the global burden of cardiometabolic disease and emphasize the importance of building a globally and ancestrally representative genetics evidence base for the identification of population-specific variants, fine-mapping, and polygenic risk score estimation. We discuss the important ethical, legal, and social implications of increasing ancestral diversity in genetic studies of cardiometabolic disease and the challenges that arise from the (1) lack of diversity in current reference populations and available analytic samples and the (2) unequal generation of health-associated genomic data and their prediction accuracies. Despite these challenges, we conclude that additional, unprecedented opportunities lie ahead for public health genomics and the realization of precision medicine, provided that the gap in diversity can be systematically addressed. Achieving this goal will require concerted efforts by social, academic, professional and regulatory stakeholders and communities, and these efforts must be based on principles of equity and social justice.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Síndrome Metabólica/genética , Frequência do Gene , Estudo de Associação Genômica Ampla/normas , Humanos , Síndrome Metabólica/epidemiologia , Polimorfismo Genético
13.
PLoS Comput Biol ; 17(2): e1007784, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33606672

RESUMO

Rare variants are thought to play an important role in the etiology of complex diseases and may explain a significant fraction of the missing heritability in genetic disease studies. Next-generation sequencing facilitates the association of rare variants in coding or regulatory regions with complex diseases in large cohorts at genome-wide scale. However, rare variant association studies (RVAS) still lack power when cohorts are small to medium-sized and if genetic variation explains a small fraction of phenotypic variance. Here we present a novel Bayesian rare variant Association Test using Integrated Nested Laplace Approximation (BATI). Unlike existing RVAS tests, BATI allows integration of individual or variant-specific features as covariates, while efficiently performing inference based on full model estimation. We demonstrate that BATI outperforms established RVAS methods on realistic, semi-synthetic whole-exome sequencing cohorts, especially when using meaningful biological context, such as functional annotation. We show that BATI achieves power above 70% in scenarios in which competing tests fail to identify risk genes, e.g. when risk variants in sum explain less than 0.5% of phenotypic variance. We have integrated BATI, together with five existing RVAS tests in the 'Rare Variant Genome Wide Association Study' (rvGWAS) framework for data analyzed by whole-exome or whole genome sequencing. rvGWAS supports rare variant association for genes or any other biological unit such as promoters, while allowing the analysis of essential functionalities like quality control or filtering. Applying rvGWAS to a Chronic Lymphocytic Leukemia study we identified eight candidate predisposition genes, including EHMT2 and COPS7A.


Assuntos
Variação Genética , Estudo de Associação Genômica Ampla/métodos , Teorema de Bayes , Benchmarking , Neoplasias da Mama/genética , Complexo do Signalossomo COP9/genética , Estudos de Casos e Controles , Estudos de Coortes , Biologia Computacional , Simulação por Computador , Interpretação Estatística de Dados , Bases de Dados Genéticas , Feminino , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/normas , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Antígenos de Histocompatibilidade/genética , Histona-Lisina N-Metiltransferase/genética , Humanos , Leucemia Linfocítica Crônica de Células B/genética , Controle de Qualidade , Fatores de Risco , Fatores de Transcrição/genética , Sequenciamento do Exoma/métodos , Sequenciamento do Exoma/normas , Sequenciamento do Exoma/estatística & dados numéricos , Sequenciamento Completo do Genoma/métodos , Sequenciamento Completo do Genoma/estatística & dados numéricos
14.
Genet Epidemiol ; 44(5): 425-441, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32190932

RESUMO

In transcriptome-wide association studies (TWAS), gene expression values are predicted using genotype data and tested for association with a phenotype. The power of this approach to detect associations relies, at least in part, on the accuracy of the prediction. Here we compare the prediction accuracy of six different methods-LASSO, Ridge regression, Elastic net, Best Linear Unbiased Predictor, Bayesian Sparse Linear Mixed Model, and Random Forests-by performing cross-validation using data from the Geuvadis Project. We also examine prediction accuracy (a) at different sample sizes, (b) when ancestry of the prediction model training and testing populations is different, and (c) when the tissue used to train the model is different from the tissue to be predicted. We find that, for most genes, the expression cannot be accurately predicted, but in general sparse statistical models tend to outperform polygenic models at prediction. Average prediction accuracy is reduced when the model training set size is reduced or when predicting across ancestries and is marginally reduced when predicting across tissues. We conclude that using sparse statistical models and the development of large reference panels across multiple ethnicities and tissues will lead to better prediction of gene expression, and thus may improve TWAS power.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Transcriptoma , Teorema de Bayes , Genótipo , Humanos , Modelos Genéticos , Modelos Estatísticos , Linhagem , Fenótipo , Reprodutibilidade dos Testes , Tamanho da Amostra
15.
Genetica ; 149(3): 143-153, 2021 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-33963492

RESUMO

Genome-wide studies are prone to false positives due to inherently low priors and statistical power. One approach to ameliorate this problem is to seek validation of reported candidate genes across independent studies: genes with repeatedly discovered effects are less likely to be false positives. Inversely, genes reported only as many times as expected by chance alone, while possibly representing novel discoveries, are also more likely to be false positives. We show that, across over 30 genome-wide studies that reported Drosophila and Daphnia genes with possible roles in thermal adaptation, the combined lists of candidate genes and orthologous groups are rapidly approaching the total number of genes and orthologous groups in the respective genomes. This is consistent with the expectation of high frequency of false positives. The majority of these spurious candidates have been identified by one or a few studies, as expected by chance alone. In contrast, a noticeable minority of genes have been identified by numerous studies with the probabilities of such discoveries occurring by chance alone being exceedingly small. For this subset of genes, different studies are in agreement with each other despite differences in the ecological settings, genomic tools and methodology, and reporting thresholds. We provide a reference set of presumed true positives among Drosophila candidate genes and orthologous groups involved in response to changes in temperature, suitable for cross-validation purposes. Despite this approach being prone to false negatives, this list of presumed true positives includes several hundred genes, consistent with the "omnigenic" concept of genetic architecture of complex traits.


Assuntos
Estudo de Associação Genômica Ampla/métodos , Locos de Características Quantitativas , Termotolerância/genética , Animais , Artrópodes/genética , Artrópodes/fisiologia , Reações Falso-Positivas , Estudo de Associação Genômica Ampla/normas , Modelos Genéticos , Polimorfismo Genético , Padrões de Referência
16.
Genetica ; 149(5-6): 313-325, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34480683

RESUMO

Reducing false discoveries caused by population stratification (PS) has always been a challenge in genome-wide association studies (GWAS). The current literature established several single marker approaches including genomic control (GC), EIGENSTRAT and generalized linear mixed model association test (GMMAT) and multi-marker methods such as LASSO mixed model (LASSOMM). However, the single-marker methods require prespecifying an arbitrary p value threshold in the selection process, likely resulting in suboptimal precision or recall. On the other hand, it appears that LASSOMM is extremely computationally intensive and may not suitable for large-scale GWAS. In this paper, we proposed a simple multi-marker approach (PCA-LASSO) combining principal component analysis (PCA) and least absolute shrinkage and selection operator (LASSO). We utilize PCA to correct for the confounding effects of PS and LASSO with built-in cross-validation for a data-driven selection. Compared to the current single-marker approaches, the proposed PCA-LASSO provides optimal balance between precision and recall, and consequently superior F1 scores. Similarly, compared to LASSOMM, PCA-LASSO markedly increases the precision while minimizing the loss of recall, and therefore improves the overall F1 score in presence of PS. More importantly, PCA-LASSO drastically reduces the computational time by > 1000 times when compared to LASSOMM. We applied PCA-LASSO to a real dataset of Alzheimer's disease and successfully identified SNP rs429358 (Gene APOE4) which has been widely reported to be associated with the onset and elevated risk of Alzheimer's disease. In conclusion, PCA-LASSO is a simple, fast, but accurate approach for GWAS in presence of latent PS.


Assuntos
Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Doença de Alzheimer/genética , Conjuntos de Dados como Assunto , Genômica , Humanos , Análise de Componente Principal , Fatores de Tempo
17.
Genet Sel Evol ; 53(1): 64, 2021 Jul 29.
Artigo em Inglês | MEDLINE | ID: mdl-34325663

RESUMO

BACKGROUND: With the completion of a single nucleotide polymorphism (SNP) chip for honey bees, the technical basis of genomic selection is laid. However, for its application in practice, methods to estimate genomic breeding values need to be adapted to the specificities of the genetics and breeding infrastructure of this species. Drone-producing queens (DPQ) are used for mating control, and usually, they head non-phenotyped colonies that will be placed on mating stations. Breeding queens (BQ) head colonies that are intended to be phenotyped and used to produce new queens. Our aim was to evaluate different breeding program designs for the initiation of genomic selection in honey bees. METHODS: Stochastic simulations were conducted to evaluate the quality of the estimated breeding values. We developed a variation of the genomic relationship matrix to include genotypes of DPQ and tested different sizes of the reference population. The results were used to estimate genetic gain in the initial selection cycle of a genomic breeding program. This program was run over six years, and different numbers of genotyped queens per year were considered. Resources could be allocated to increase the reference population, or to perform genomic preselection of BQ and/or DPQ. RESULTS: Including the genotypes of 5000 phenotyped BQ increased the accuracy of predictions of breeding values by up to 173%, depending on the size of the reference population and the trait considered. To initiate a breeding program, genotyping a minimum number of 1000 queens per year is required. In this case, genetic gain was highest when genomic preselection of DPQ was coupled with the genotyping of 10-20% of the phenotyped BQ. For maximum genetic gain per used genotype, more than 2500 genotyped queens per year and preselection of all BQ and DPQ are required. CONCLUSIONS: This study shows that the first priority in a breeding program is to genotype phenotyped BQ to obtain a sufficiently large reference population, which allows successful genomic preselection of queens. To maximize genetic gain, DPQ should be preselected, and their genotypes included in the genomic relationship matrix. We suggest, that the developed methods for genomic prediction are suitable for implementation in genomic honey bee breeding programs.


Assuntos
Abelhas/genética , Modelos Genéticos , Seleção Artificial , Animais , Genoma de Inseto , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Técnicas de Genotipagem/métodos
18.
Genet Sel Evol ; 53(1): 55, 2021 Jun 29.
Artigo em Inglês | MEDLINE | ID: mdl-34187354

RESUMO

BACKGROUND: Mathematical models are needed for the design of breeding programs using genomic prediction. While deterministic models for selection on pedigree-based estimates of breeding values (PEBV) are available, these have not been fully developed for genomic selection, with a key missing component being the accuracy of genomic EBV (GEBV) of selection candidates. Here, a deterministic method was developed to predict this accuracy within a closed breeding population based on the accuracy of GEBV and PEBV in the reference population and the distance of selection candidates from their closest ancestors in the reference population. METHODS: The accuracy of GEBV was modeled as a combination of the accuracy of PEBV and of EBV based on genomic relationships deviated from pedigree (DEBV). Loss of the accuracy of DEBV from the reference to the target population was modeled based on the effective number of independent chromosome segments in the reference population (Me). Measures of Me derived from the inverse of the variance of relationships and from the accuracies of GEBV and PEBV in the reference population, derived using either a Fisher information or a selection index approach, were compared by simulation. RESULTS: Using simulation, both the Fisher and the selection index approach correctly predicted accuracy in the target population over time, both with and without selection. The index approach, however, resulted in estimates of Me that were less affected by heritability, reference size, and selection, and which are, therefore, more appropriate as a population parameter. The variance of relationships underpredicted Me and was greatly affected by selection. A leave-one-out cross-validation approach was proposed to estimate required accuracies of EBV in the reference population. Aspects of the methods were validated using real data. CONCLUSIONS: A deterministic method was developed to predict the accuracy of GEBV in selection candidates in a closed breeding population. The population parameter Me that is required for these predictions can be derived from an available reference data set, and applied to other reference data sets and traits for that population. This method can be used to evaluate the benefit of genomic prediction and to optimize genomic selection breeding programs.


Assuntos
Modelos Genéticos , Seleção Artificial , Animais , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Gado/genética , Linhagem , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas
19.
Genet Sel Evol ; 53(1): 46, 2021 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-34058971

RESUMO

BACKGROUND: In dairy cattle populations in which crossbreeding has been used, animals show some level of diversity in their origins. In rotational crossbreeding, for instance, crossbred dams are mated with purebred sires from different pure breeds, and the genetic composition of crossbred animals is an admixture of the breeds included in the rotation. How to use the data of such individuals in genomic evaluations is still an open question. In this study, we aimed at providing methodologies for the use of data from crossbred individuals with an admixed genetic background together with data from multiple pure breeds, for the purpose of genomic evaluations for both purebred and crossbred animals. A three-breed rotational crossbreeding system was mimicked using simulations based on animals genotyped with the 50 K single nucleotide polymorphism (SNP) chip. RESULTS: For purebred populations, within-breed genomic predictions generally led to higher accuracies than those from multi-breed predictions using combined data of pure breeds. Adding admixed population's (MIX) data to the combined pure breed data considering MIX as a different breed led to higher accuracies. When prediction models were able to account for breed origin of alleles, accuracies were generally higher than those from combining all available data, depending on the correlation of quantitative trait loci (QTL) effects between the breeds. Accuracies varied when using SNP effects from any of the pure breeds to predict the breeding values of MIX. Using those breed-specific SNP effects that were estimated separately in each pure breed, while accounting for breed origin of alleles for the selection candidates of MIX, generally improved the accuracies. Models that are able to accommodate MIX data with the breed origin of alleles approach generally led to higher accuracies than models without breed origin of alleles, depending on the correlation of QTL effects between the breeds. CONCLUSIONS: Combining all available data, pure breeds' and admixed population's data, in a multi-breed reference population is beneficial for the estimation of breeding values for pure breeds with a small reference population. For MIX, such an approach can lead to higher accuracies than considering breed origin of alleles for the selection candidates, and using breed-specific SNP effects estimated separately in each pure breed. Including MIX data in the reference population of multiple breeds by considering the breed origin of alleles, accuracies can be further improved. Our findings are relevant for breeding programs in which crossbreeding is systematically applied, and also for populations that involve different subpopulations and between which exchange of genetic material is routine practice.


Assuntos
Bovinos/genética , Hibridização Genética , Polimorfismo de Nucleotídeo Único , Animais , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Endogamia , Modelos Genéticos , Locos de Características Quantitativas , Padrões de Referência , Seleção Artificial
20.
Proc Natl Acad Sci U S A ; 115(22): E4970-E4979, 2018 05 29.
Artigo em Inglês | MEDLINE | ID: mdl-29686100

RESUMO

Identifying causal effects in nonexperimental data is an enduring challenge. One proposed solution that recently gained popularity is the idea to use genes as instrumental variables [i.e., Mendelian randomization (MR)]. However, this approach is problematic because many variables of interest are genetically correlated, which implies the possibility that many genes could affect both the exposure and the outcome directly or via unobserved confounding factors. Thus, pleiotropic effects of genes are themselves a source of bias in nonexperimental data that would also undermine the ability of MR to correct for endogeneity bias from nongenetic sources. Here, we propose an alternative approach, genetic instrumental variable (GIV) regression, that provides estimates for the effect of an exposure on an outcome in the presence of pleiotropy. As a valuable byproduct, GIV regression also provides accurate estimates of the chip heritability of the outcome variable. GIV regression uses polygenic scores (PGSs) for the outcome of interest which can be constructed from genome-wide association study (GWAS) results. By splitting the GWAS sample for the outcome into nonoverlapping subsamples, we obtain multiple indicators of the outcome PGSs that can be used as instruments for each other and, in combination with other methods such as sibling fixed effects, can address endogeneity bias from both pleiotropy and the environment. In two empirical applications, we demonstrate that our approach produces reasonable estimates of the chip heritability of educational attainment (EA) and show that standard regression and MR provide upwardly biased estimates of the effect of body height on EA.


Assuntos
Pleiotropia Genética , Variação Genética , Estudo de Associação Genômica Ampla , Fatores Socioeconômicos , Estatura/fisiologia , Escolaridade , Estudo de Associação Genômica Ampla/normas , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Avaliação de Resultados em Cuidados de Saúde
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA