Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 99
Filtrar
1.
Am J Hum Genet ; 111(5): 990-995, 2024 05 02.
Artigo em Inglês | MEDLINE | ID: mdl-38636510

RESUMO

Since genotype imputation was introduced, researchers have been relying on the estimated imputation quality from imputation software to perform post-imputation quality control (QC). However, this quality estimate (denoted as Rsq) performs less well for lower-frequency variants. We recently published MagicalRsq, a machine-learning-based imputation quality calibration, which leverages additional typed markers from the same cohort and outperforms Rsq as a QC metric. In this work, we extended the original MagicalRsq to allow cross-cohort model training and named the new model MagicalRsq-X. We removed the cohort-specific estimated minor allele frequency and included linkage disequilibrium scores and recombination rates as additional features. Leveraging whole-genome sequencing data from TOPMed, specifically participants in the BioMe, JHS, WHI, and MESA studies, we performed comprehensive cross-cohort evaluations for predominantly European and African ancestral individuals based on their inferred global ancestry with the 1000 Genomes and Human Genome Diversity Project data as reference. Our results suggest MagicalRsq-X outperforms Rsq in almost every setting, with 7.3%-14.4% improvement in squared Pearson correlation with true R2, corresponding to 85-218 K variant gains. We further developed a metric to quantify the genetic distances of a target cohort relative to a reference cohort and showed that such metric largely explained the performance of MagicalRsq-X models. Finally, we found MagicalRsq-X saved up to 53 known genome-wide significant variants in one of the largest blood cell trait GWASs that would be missed using the original Rsq for QC. In conclusion, MagicalRsq-X shows superiority for post-imputation QC and benefits genetic studies by distinguishing well and poorly imputed lower-frequency variants.


Assuntos
Frequência do Gene , Genótipo , Polimorfismo de Nucleotídeo Único , Software , Humanos , Estudos de Coortes , Desequilíbrio de Ligação , Estudo de Associação Genômica Ampla/métodos , Genoma Humano , Controle de Qualidade , Aprendizado de Máquina , Sequenciamento Completo do Genoma/normas , Sequenciamento Completo do Genoma/métodos
2.
Am J Hum Genet ; 109(6): 1007-1015, 2022 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-35508176

RESUMO

Genotype imputation is an integral tool in genome-wide association studies, in which it facilitates meta-analysis, increases power, and enables fine-mapping. With the increasing availability of whole-genome-sequence datasets, investigators have access to a multitude of reference-panel choices for genotype imputation. In principle, combining all sequenced whole genomes into a single large panel would provide the best imputation performance, but this is often cumbersome or impossible due to privacy restrictions. Here, we describe meta-imputation, a method that allows imputation results generated using different reference panels to be combined into a consensus imputed dataset. Our meta-imputation method requires small changes to the output of existing imputation tools to produce necessary inputs, which are then combined using dynamically estimated weights that are tailored to each individual and genome segment. In the scenarios we examined, the method consistently outperforms imputation using a single reference panel and achieves accuracy comparable to imputation using a combined reference panel.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Genoma , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , Polimorfismo de Nucleotídeo Único/genética , Projetos de Pesquisa
3.
Am J Hum Genet ; 109(9): 1653-1666, 2022 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-35981533

RESUMO

Understanding the genetic basis of human diseases and traits is dependent on the identification and accurate genotyping of genetic variants. Deep whole-genome sequencing (WGS), the gold standard technology for SNP and indel identification and genotyping, remains very expensive for most large studies. Here, we quantify the extent to which array genotyping followed by genotype imputation can approximate WGS in studies of individuals of African, Hispanic/Latino, and European ancestry in the US and of Finnish ancestry in Finland (a population isolate). For each study, we performed genotype imputation by using the genetic variants present on the Illumina Core, OmniExpress, MEGA, and Omni 2.5M arrays with the 1000G, HRC, and TOPMed imputation reference panels. Using the Omni 2.5M array and the TOPMed panel, ≥90% of bi-allelic single-nucleotide variants (SNVs) are well imputed (r2 > 0.8) down to minor-allele frequencies (MAFs) of 0.14% in African, 0.11% in Hispanic/Latino, 0.35% in European, and 0.85% in Finnish ancestries. There was little difference in TOPMed-based imputation quality among the arrays with >700k variants. Individual-level imputation quality varied widely between and within the three US studies. Imputation quality also varied across genomic regions, producing regions where even common (MAF > 5%) variants were consistently not well imputed across ancestries. The extent to which array genotyping and imputation can approximate WGS therefore depends on reference panel, genotype array, sample ancestry, and genomic location. Imputation quality by variant or genomic region can be queried with our new tool, RsqBrowser, now deployed on the Michigan Imputation Server.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Polimorfismo de Nucleotídeo Único , Frequência do Gene/genética , Estudo de Associação Genômica Ampla , Genótipo , Humanos , Polimorfismo de Nucleotídeo Único/genética , Sequenciamento Completo do Genoma
4.
Am J Hum Genet ; 109(11): 1986-1997, 2022 11 03.
Artigo em Inglês | MEDLINE | ID: mdl-36198314

RESUMO

Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R2 than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único/genética , Calibragem , Genótipo , Aprendizado de Máquina
5.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38221906

RESUMO

Large-scale imputation reference panels are currently available and have contributed to efficient genome-wide association studies through genotype imputation. However, whether large-size multi-ancestry or small-size population-specific reference panels are the optimal choices for under-represented populations continues to be debated. We imputed genotypes of East Asian (180k Japanese) subjects using the Trans-Omics for Precision Medicine reference panel and found that the standard imputation quality metric (Rsq) overestimated dosage r2 (squared correlation between imputed dosage and true genotype) particularly in marginal-quality bins. Variance component analysis of Rsq revealed that the increased imputed-genotype certainty (dosages closer to 0, 1 or 2) caused upward bias, indicating some systemic bias in the imputation. Through systematic simulations using different template switching rates (θ value) in the hidden Markov model, we revealed that the lower θ value increased the imputed-genotype certainty and Rsq; however, dosage r2 was insensitive to the θ value, thereby causing a deviation. In simulated reference panels with different sizes and ancestral diversities, the θ value estimates from Minimac decreased with the size of a single ancestry and increased with the ancestral diversity. Thus, Rsq could be deviated from dosage r2 for a subpopulation in the multi-ancestry panel, and the deviation represents different imputed-dosage distributions. Finally, despite the impact of the θ value, distant ancestries in the reference panel contributed only a few additional variants passing a predefined Rsq threshold. We conclude that the θ value substantially impacts the imputed dosage and the imputation quality metric value.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Frequência do Gene , Genótipo
6.
Genet Epidemiol ; 47(2): 121-134, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36490288

RESUMO

The large-scale open access whole-exome sequencing (WES) data of the UK Biobank ~200,000 participants is accelerating a new wave of genetic association studies aiming to identify rare and functional loss-of-function (LoF) variants associated with complex traits and diseases. We proposed to merge the WES genotypes and the genome-wide genotyping (GWAS) genotypes of 167,000 UKB homogeneous European participants into a combined reference panel, and then to impute 241,911 UKB homogeneous European participants who had the GWAS genotypes only. We then used the imputed data to replicate association identified in the discovery WES sample. The average imputation accuracy measure r2 is modest to high for LoF variants at all minor allele frequency intervals: 0.942 at MAF interval (0.01, 0.5), 0.807 at (1.0 × 10-3 , 0.01), 0.805 at (1.0 × 10-4 , 1.0 × 10-3 ), 0.664 at (1.0 × 10-5 , 1.0 × 10-4 ) and 0.410 at (0, 1.0 × 10-5 ). As applications, we studied associations of LoF variants with estimated heel BMD and four lipid traits. In addition to replicating dozens of previously reported genes, we also identified three novel associations, two genes PLIN1 and ANGPTL3 for high-density-lipoprotein cholesterol and one gene PDE3B for triglycerides. Our results highlighted the strength of WES based genotype imputation as well as provided useful imputed data within the UKB cohort.


Assuntos
Bancos de Espécimes Biológicos , Exoma , Humanos , Sequenciamento do Exoma , Genótipo , Frequência do Gene , Reino Unido , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Proteína 3 Semelhante a Angiopoietina
7.
Mamm Genome ; 35(3): 461-473, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39028337

RESUMO

Ancient DNA provides a unique frame for directly studying human population genetics in time and space. Still, since most of the ancient genomic data is low coverage, analysis is confronted with a low number of SNPs, genotype uncertainties, and reference-bias. Here, we for the first time benchmark the two distinct versions of Glimpse tools on 120 ancient human genomes from Eurasia including those largely from previously under-evaluated regions and compare the performance of genotype imputation with de facto analysis approaches for low coverage genomic data analysis. We further investigate the impact of two distinct reference panels on imputation accuracy for low coverage genomic data. We compute accuracy statistics and perform PCA and f4-statistics to explore the behaviour of genotype imputation on low coverage data regarding (i)two versions of Glimpse, (ii)two reference panels, (iii)four post-imputation filters and coverages, as well as (iv)data type and geographical origin of the samples on the analyses. Our results reveal that even for 0.1X coverage ancient human genomes, genotype imputation using Glimpse-v2 is suitable. Additionally, using the 1000 Genomes merged with Human Genome Diversity Panel improves the accuracy of imputation for the rare variants with low MAF, which might be important not only for ancient genomics but also for modern human genomic studies based on low coverage data and for haplotype-based analysis. Most importantly, we reveal that genotype imputation of low coverage ancient human genomes reduces the genetic affinity of the samples towards human reference genome. Through solving one of the most challenging biases in data analysis, so-called reference bias, genotype imputation using Glimpse v2 is promising for low coverage ancient human genomic data analysis and for rare-variant-based and haplotype-based analysis.


Assuntos
DNA Antigo , Genoma Humano , Genótipo , Polimorfismo de Nucleotídeo Único , Humanos , DNA Antigo/análise , Genética Populacional/métodos , Software , Genômica/métodos
8.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-36088550

RESUMO

Somatic variants act as critical players during cancer occurrence and development. Thus, an accurate and robust method to identify them is the foundation of cutting-edge cancer genome research. However, due to low accessibility and high individual-/sample-specificity of the somatic variants in tumor samples, the detection is, to date, still crammed with challenges, particularly when lacking paired normal samples as control. To solve this burning issue, we developed a tumor-only somatic and germline variant identification method (TSomVar) using the random forest algorithm established on sample-specific variant datasets derived from genotype imputation, reads-mapping level annotation and functional annotation. We trained TSomVar by using genomic variant datasets of three major cancer types: colorectal cancer, hepatocellular carcinoma and skin cutaneous melanoma. Compared with existing tumor-only somatic variant identification tools, TSomVar shows excellent performances in somatic variant detection with higher accuracy and better capability of recalling for test datasets from colorectal cancer and skin cutaneous melanoma. In addition, TSomVar is equipped with the competence of accurately identifying germline variants in tumor samples. Taken together, TSomVar will undoubtedly facilitate and revolutionize somatic variant explorations in cancer research.


Assuntos
Neoplasias Colorretais , Melanoma , Neoplasias , Neoplasias Cutâneas , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Melanoma/genética , Neoplasias/genética , Neoplasias Cutâneas/genética , Melanoma Maligno Cutâneo
9.
Clin Genet ; 106(3): 284-292, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-38719617

RESUMO

Genetic maps are fundamental resources for linkage and association studies. A fine-scale genetic map can be constructed by inferring historical recombination events from the genome-wide structure of linkage disequilibrium-a non-random association of alleles among loci-by using population-scale sequencing data. We constructed a fine-scale genetic map and identified recombination hotspots from 10 092 551 bi-allelic high-quality autosomal markers segregating among 150 unrelated Japanese individuals whose genotypes were determined by high-coverage (30×) whole-genome sequencing, and the genotype quality was carefully controlled by using their parents' and offspring's genotypes. The pedigree information was also utilized for haplotype phasing. The resulting genome-wide recombination rate profiles were concordant with those of the worldwide population on a broad scale, and the resolution was much improved. We identified 9487 recombination hotspots and confirmed the enrichment of previously known motifs in the hotspots. Moreover, we demonstrated that the Japanese genetic map improved the haplotype phasing and genotype imputation accuracy for the Japanese population. The construction of a population-specific genetic map will help make genetics research more accurate.


Assuntos
Mapeamento Cromossômico , População do Leste Asiático , Desequilíbrio de Ligação , Recombinação Genética , Humanos , Alelos , População do Leste Asiático/genética , Ligação Genética , Genética Populacional , Genoma Humano , Estudo de Associação Genômica Ampla , Genótipo , Haplótipos , Japão , Linhagem , Polimorfismo de Nucleotídeo Único , Sequenciamento Completo do Genoma
10.
BMC Biol ; 21(1): 286, 2023 12 08.
Artigo em Inglês | MEDLINE | ID: mdl-38066581

RESUMO

BACKGROUND: Genomic prediction describes the use of SNP genotypes to predict complex traits and has been widely applied in humans and agricultural species. Genotyping-by-sequencing, a method which uses low-coverage sequence data paired with genotype imputation, is becoming an increasingly popular SNP genotyping method for genomic prediction. The development of Oxford Nanopore Technologies' (ONT) MinION sequencer has now made genotyping-by-sequencing portable and rapid. Here we evaluate the speed and accuracy of genomic predictions using low-coverage ONT sequence data in a population of cattle using four imputation approaches. We also investigate the effect of SNP reference panel size on imputation performance. RESULTS: SNP array genotypes and ONT sequence data for 62 beef heifers were used to calculate genomic estimated breeding values (GEBVs) from 641 k SNP for four traits. GEBV accuracy was much higher when genome-wide flanking SNP from sequence data were used to help impute the 641 k panel used for genomic predictions. Using the imputation package QUILT, correlations between ONT and low-density SNP array genomic breeding values were greater than 0.91 and up to 0.97 for sequencing coverages as low as 0.1 × using a reference panel of 48 million SNP. Imputation time was significantly reduced by decreasing the number of flanking sequence SNP used in imputation for all methods. When compared to high-density SNP arrays, genotyping accuracy and genomic breeding value correlations at 0.5 × coverage were also found to be higher than those imputed from low-density arrays. CONCLUSIONS: Here we demonstrated accurate genomic prediction is possible with ONT sequence data from sequencing coverages as low as 0.1 × , and imputation time can be as short as 10 min per sample. We also demonstrate that in this population, genotyping-by-sequencing at 0.1 × coverage can be more accurate than imputation from low-density SNP arrays.


Assuntos
Sequenciamento por Nanoporos , Humanos , Animais , Bovinos/genética , Feminino , Polimorfismo de Nucleotídeo Único , Genoma , Genômica/métodos , Genótipo
11.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33236761

RESUMO

Haplotype phasing is a critical step for many genetic applications but incorrect estimates of phase can negatively impact downstream analyses. One proposed strategy to improve phasing accuracy is to combine multiple independent phasing estimates to overcome the limitations of any individual estimate. However, such a strategy is yet to be thoroughly explored. This study provides a comprehensive evaluation of consensus strategies for haplotype phasing. We explore the performance of different consensus paradigms, and the effect of specific constituent tools, across several datasets with different characteristics and their impact on the downstream task of genotype imputation. Based on the outputs of existing phasing tools, we explore two different strategies to construct haplotype consensus estimators: voting across outputs from multiple phasing tools and multiple outputs of a single non-deterministic tool. We find that the consensus approach from multiple tools reduces SE by an average of 10% compared to any constituent tool when applied to European populations and has the highest accuracy regardless of population ethnicity, sample size, variant density or variant frequency. Furthermore, the consensus estimator improves the accuracy of the downstream task of genotype imputation carried out by the widely used Minimac3, pbwt and BEAGLE5 tools. Our results provide guidance on how to produce the most accurate phasing estimates and the trade-offs that a consensus approach may have. Our implementation of consensus haplotype phasing, consHap, is available freely at https://github.com/ziadbkh/consHap. Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.


Assuntos
Algoritmos , Bases de Dados de Ácidos Nucleicos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA , Haplótipos , Humanos
12.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34402866

RESUMO

Genotype imputation is a statistical method for estimating missing genotypes from a denser haplotype reference panel. Existing methods usually performed well on common variants, but they may not be ideal for low-frequency and rare variants. Previous studies showed that the population similarity between study and reference panels is one of the key factors influencing the imputation accuracy. Here, we developed an imputation reference panel reconstruction method (RefRGim) using convolutional neural networks (CNNs), which can generate a study-specified reference panel for each input data based on the genetic similarity of individuals from current study and references. The CNNs were pretrained with single nucleotide polymorphism data from the 1000 Genomes Project. Our evaluations showed that genotype imputation with RefRGim can achieve higher accuracies than original reference panel, especially for low-frequency and rare variants. RefRGim will serve as an efficient reference panel reconstruction method for genotype imputation. RefRGim is freely available via GitHub: https://github.com/shishuo16/RefRGim.


Assuntos
Biologia Computacional/métodos , Genótipo , Técnicas de Genotipagem/métodos , Redes Neurais de Computação , Software , Algoritmos , Bases de Dados Genéticas , Aprendizado Profundo , Genética Populacional/métodos , Estudo de Associação Genômica Ampla/métodos , Humanos , Reprodutibilidade dos Testes , Navegador
13.
J Biomed Inform ; 143: 104423, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37308034

RESUMO

OBJECTIVE: Genotype imputation is a commonly used technique that infers un-typed variants into a study's genotype data, allowing better identification of causal variants in disease studies. However, due to overrepresentation of Caucasian studies, there's a lack of understanding of genetic basis of health-outcomes in other ethnic populations. Therefore, facilitating imputation of missing key-predictor-variants that can potentially improve a risk health-outcome prediction model, specifically for Asian ancestry, is of utmost relevance. METHODS: We aimed to construct an imputation and analysis web-platform, that primarily facilitates, but is not limited to genotype imputation on East-Asians. The goal is to provide a collaborative imputation platform for researchers in the public domain towards rapidly and efficiently conducting accurate genotype imputation. RESULTS: We present an online genotype imputation platform, Multi-ethnic Imputation System (MI-System) (https://misystem.cgm.ntu.edu.tw/), that offers users 3 established pipelines, SHAPEIT2-IMPUTE2, SHAPEIT4-IMPUTE5, and Beagle5.1 for conducting imputation analyses. In addition to 1000 Genomes and Hapmap3, a new customized Taiwan Biobank (TWB) reference panel, specifically created for Taiwanese-Chinese ancestry is provided. MI-System further offers functions to create customized reference panels to be used for imputation, conduct quality control, split whole genome data into chromosomes, and convert genome builds. CONCLUSION: Users can upload their genotype data and perform imputation with minimum effort and resources. The utility functions further can be utilized to preprocess user uploaded data with easy clicks. MI-System potentially contributes to Asian-population genetics research, while eliminating the requirement for high performing computational resources and bioinformatics expertise. It will enable an increased pace of research and provide a knowledge-base for genetic carriers of complex diseases, therefore greatly enhancing patient-driven research. STATEMENT OF SIGNIFICANCE: Multi-ethnic Imputation System (MI-System), primarily facilitates, but is not limited to, imputation on East-Asians, through 3 established prephasing-imputation pipelines, SHAPEIT2-IMPUTE2, SHAPEIT4-IMPUTE5, and Beagle5.1, where users can upload their genotype data and perform imputation and other utility functions with minimum effort and resources. A new customized Taiwan Biobank (TWB) reference panel, specifically created for Taiwanese-Chinese ancestry is provided. Utility functions include (a) create customized reference panels, (b) conduct quality control, (c) split whole genome data into chromosomes, and (d) convert genome builds. Users can also combine 2 reference panels using the system and use combined panels as reference to conduct imputation using MI-System.


Assuntos
Genética Populacional , Genoma , Humanos , Frequência do Gene , Genótipo , Computadores , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único
14.
Twin Res Hum Genet ; 26(1): 10-20, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36896826

RESUMO

Reading difficulties are prevalent worldwide, including in economically developed countries, and are associated with low academic achievement and unemployment. Longitudinal studies have identified several early childhood predictors of reading ability, but studies frequently lack genotype data that would enable testing of predictors with heritable influences. The National Child Development Study (NCDS) is a UK birth cohort study containing direct reading skill variables at every data collection wave from age 7 years through to adulthood with a subsample (final n = 6431) for whom modern genotype data are available. It is one of the longest running UK cohort studies for which genotyped data are currently available and is a rich dataset with excellent potential for future phenotypic and gene-by-environment interaction studies in reading. Here, we carry out imputation of the genotype data to the Haplotype Reference Panel, an updated reference panel that offers greater imputation quality. Guiding phenotype choice, we report a principal components analysis of nine reading variables, yielding a composite measure of reading ability in the genotyped sample. We include recommendations for use of composite scores and the most reliable variables for use during childhood when conducting longitudinal, genetically sensitive analyses of reading ability.


Assuntos
Desenvolvimento Infantil , Cognição , Humanos , Pré-Escolar , Estudos de Coortes , Genótipo , Fenótipo , Polimorfismo de Nucleotídeo Único
15.
BMC Bioinformatics ; 23(1): 50, 2022 Jan 24.
Artigo em Inglês | MEDLINE | ID: mdl-35073846

RESUMO

BACKGROUND: Imputation of untyped markers is a standard tool in genome-wide association studies to close the gap between directly genotyped and other known DNA variants. However, high accuracy with which genotypes are imputed is fundamental. Several accuracy measures have been proposed and some are implemented in imputation software, unfortunately diversely across platforms. In the present paper, we introduce Iam hiQ, an independent pair of accuracy measures that can be applied to dosage files, the output of all imputation software. Iam (imputation accuracy measure) quantifies the average amount of individual-specific versus population-specific genotype information in a linear manner. hiQ (heterogeneity in quantities of dosages) addresses the inter-individual heterogeneity between dosages of a marker across the sample at hand. RESULTS: Applying both measures to a large case-control sample of the International Lung Cancer Consortium (ILCCO), comprising 27,065 individuals, we found meaningful thresholds for Iam and hiQ suitable to classify markers of poor accuracy. We demonstrate how Manhattan-like plots and moving averages of Iam and hiQ can be useful to identify regions enriched with less accurate imputed markers, whereas these regions would by missed when applying the accuracy measure info (implemented in IMPUTE2). CONCLUSION: We recommend using Iam hiQ additional to other accuracy scores for variant filtering before stepping into the analysis of imputed GWAS data.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Estudos de Casos e Controles , Genótipo , Humanos , Software
16.
BMC Bioinformatics ; 23(1): 356, 2022 Aug 29.
Artigo em Inglês | MEDLINE | ID: mdl-36038834

RESUMO

BACKGROUND: The decreasing cost of DNA sequencing has led to a great increase in our knowledge about genetic variation. While population-scale projects bring important insight into genotype-phenotype relationships, the cost of performing whole-genome sequencing on large samples is still prohibitive. In-silico genotype imputation coupled with genotyping-by-arrays is a cost-effective and accurate alternative for genotyping of common and uncommon variants. Imputation methods compare the genotypes of the typed variants with the large population-specific reference panels and estimate the genotypes of untyped variants by making use of the linkage disequilibrium patterns. Most accurate imputation methods are based on the Li-Stephens hidden Markov model, HMM, that treats the sequence of each chromosome as a mosaic of the haplotypes from the reference panel. RESULTS: Here we assess the accuracy of vicinity-based HMMs, where each untyped variant is imputed using the typed variants in a small window around itself (as small as 1 centimorgan). Locality-based imputation is used recently by machine learning-based genotype imputation approaches. We assess how the parameters of the vicinity-based HMMs impact the imputation accuracy in a comprehensive set of benchmarks and show that vicinity-based HMMs can accurately impute common and uncommon variants. CONCLUSIONS: Our results indicate that locality-based imputation models can be effectively used for genotype imputation. The parameter settings that we identified can be used in future methods and vicinity-based HMMs can be used for re-structuring and parallelizing new imputation methods. The source code for the vicinity-based HMM implementations is publicly available at https://github.com/harmancilab/LoHaMMer .


Assuntos
Polimorfismo de Nucleotídeo Único , Software , Estudo de Associação Genômica Ampla/métodos , Genótipo , Haplótipos , Desequilíbrio de Ligação , Análise de Sequência de DNA/métodos
17.
Anim Biotechnol ; 33(6): 1205-1216, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34010090

RESUMO

Genetic analysis of porcine growth and fatness traits is beneficial to the swine industry and provides a reference to understand human obesity. Here, we obtained 29 growth and fatness traits for 473 individuals from a White Duroc × Erhualian F3 intercross population. Basic statistical analyses showed that: (1) Positive correlations between different-stage body weights were detected, the shorter the time interval the stronger the correlation. (2) Strong correlations existed in the paired fatness traits. (3) With the growth of age, the correlation between fatness and body weight was increasing. All pigs were genotyped by Illumina 50 K SNP chips and their whole-genome genotypes were imputed referred to 109 re-sequencing data. We performed common and imputation-based GWASs for these traits. Two genome-wide significant loci on swine chromosome (SSC) 4 and 7 were repeatedly detected. The strongest association (P = 3.24 × 10-19) was detected at 31.96 Mb on SSC7 for leaf fat weight. On this locus, seven major haplotypes were identified, of which two were novel and had an increasing-fatness effect. In the imputation-based GWAS, three new loci were identified. Our findings provide further insights into and enhance our understanding of genetic mechanism of porcine growth and fat deposition.


Assuntos
Estudo de Associação Genômica Ampla , Obesidade , Locos de Características Quantitativas , Animais , Humanos , Genótipo , Haplótipos/genética , Fenótipo , Locos de Características Quantitativas/genética , Suínos/genética , Obesidade/genética
18.
J Dairy Sci ; 105(4): 3355-3366, 2022 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-35151474

RESUMO

Low-coverage sequencing (LCS) followed by imputation has been proposed as a cost-effective genotyping approach for obtaining genotypes of whole-genome variants. Imputation performance is essential for the effectiveness of this approach. Several imputation methods have been proposed and successfully applied in genomic studies in human and other species. However, there are few reports on the performance of these methods in livestock. Here, we evaluated a variety of imputation methods, including Beagle v4.1, GeneImp v1.3, GLIMPSE v1.1.0, QUILT v1.0.0, Reveel, and STITCH v1.6.5, with varying sequencing depth, sample size, and reference panel size using LCS data of Holstein cattle. We found that all of these methods, except Reveel, performed well in most cases with an imputation accuracy over 0.9; on the whole, GLIMPSE, QUILT, and STITCH performed better than the other methods. For species with no reference panel available, STITCH followed by Beagle would be an optimal strategy, whereas for species with reference panel available, QUILT would be the method of choice. Overall, this study illustrated the promising potential of LCS for genomic analysis in livestock.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Polimorfismo de Nucleotídeo Único , Animais , Bovinos/genética , Genômica/métodos , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/veterinária , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/veterinária
19.
Int J Mol Sci ; 23(21)2022 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-36362120

RESUMO

Total number born (TNB), number of stillborn (NSB), and gestation length (GL) are economically important traits in pig production, and disentangling the molecular mechanisms associated with traits can provide valuable insights into their genetic structure. Genotype imputation can be used as a practical tool to improve the marker density of single-nucleotide polymorphism (SNP) chips based on sequence data, thereby dramatically improving the power of genome-wide association studies (GWAS). In this study, we applied Beagle software to impute the 50 K chip data to the whole-genome sequencing (WGS) data with average imputation accuracy (R2) of 0.876. The target pigs, 2655 Large White pigs introduced from Canadian and French lines, were genotyped by a GeneSeek Porcine 50K chip. The 30 Large White reference pigs were the key ancestral individuals sequenced by whole-genome resequencing. To avoid population stratification, we identified genetic variants associated with reproductive traits by performing within-population GWAS and cross-population meta-analyses with data before and after imputation. Finally, several genes were detected and regarded as potential candidate genes for each of the traits: for the TNB trait: NOTCH2, KLF3, PLXDC2, NDUFV1, TLR10, CDC14A, EPC2, ORC4, ACVR2A, and GSC; for the NSB trait: NUB1, TGFBR3, ZDHHC14, FGF14, BAIAP2L1, EVI5, TAF1B, and BCAR3; for the GL trait: PPP2R2B, AMBP, MALRD1, HOXA11, and BICC1. In conclusion, expanding the size of the reference population and finding an optimal imputation strategy to ensure that more loci are obtained for GWAS under high imputation accuracy will contribute to the identification of causal mutations in pig breeding.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Animais , Canadá , Genótipo , Análise de Sequência com Séries de Oligonucleotídeos , Fenótipo , Suínos/genética
20.
Genet Epidemiol ; 44(6): 537-549, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-32519380

RESUMO

A key aim for current genome-wide association studies (GWAS) is to interrogate the full spectrum of genetic variation underlying human traits, including rare variants, across populations. Deep whole-genome sequencing is the gold standard to fully capture genetic variation, but remains prohibitively expensive for large sample sizes. Array genotyping interrogates a sparser set of variants, which can be used as a scaffold for genotype imputation to capture a wider set of variants. However, imputation quality depends crucially on reference panel size and genetic distance from the target population. Here, we consider sequencing a subset of GWAS participants and imputing the rest using a reference panel that includes both sequenced GWAS participants and an external reference panel. We investigate how imputation quality and GWAS power are affected by the number of participants sequenced for admixed populations (African and Latino Americans) and European population isolates (Sardinians and Finns), and identify powerful, cost-effective GWAS designs given current sequencing and array costs. For populations that are well-represented in existing reference panels, we find that array genotyping alone is cost-effective and well-powered to detect common- and rare-variant associations. For poorly represented populations, sequencing a subset of participants is often most cost-effective, and can substantially increase imputation quality and GWAS power.


Assuntos
Genoma Humano , Estudo de Associação Genômica Ampla , Sequenciamento Completo do Genoma , Análise Custo-Benefício , Frequência do Gene/genética , Estudo de Associação Genômica Ampla/economia , Genótipo , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Sequenciamento Completo do Genoma/economia
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa