Pesquisa | Portal de Pesquisa da BVS Enfermagem

1.

Improving variant calling using population data and deep learning.

Chen, Nae-Chyun; Kolesnikov, Alexey; Goel, Sidharth; Yun, Taedong; Chang, Pi-Chuan; Carroll, Andrew.

BMC Bioinformatics ; 24(1): 197, 2023 May 12.

Artigo em Inglês | MEDLINE | ID: mdl-37173615

RESUMO

Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.

Assuntos

Aprendizado Profundo , Humanos , Frequência do Gene , Sequenciamento Completo do Genoma , Estudo de Associação Genômica Ampla , Genoma Humano , Polimorfismo de Nucleotídeo Único , Sequenciamento de Nucleotídeos em Larga Escala

2.

Accurate, scalable cohort variant calls using DeepVariant and GLnexus.

Yun, Taedong; Li, Helen; Chang, Pi-Chuan; Lin, Michael F; Carroll, Andrew; McLean, Cory Y.

Bioinformatics ; 36(24): 5582-5589, 2021 Apr 05.

Artigo em Inglês | MEDLINE | ID: mdl-33399819

RESUMO

MOTIVATION: Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging. RESULTS: We introduce an open-source cohort-calling method that uses the highly accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimize the method across a range of cohort sizes, sequencing methods and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently generated GATK Best Practices pipeline. AVAILABILITY AND IMPLEMENTATION: We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-source, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.

Utilizing multimodal AI to improve genetic analyses of cardiovascular traits.

Zhou, Yuchen; Cosentino, Justin; Yun, Taedong; Biradar, Mahantesh I; Shreibati, Jacqueline; Lai, Dongbing; Schwantes-An, Tae-Hwi; Luben, Robert; McCaw, Zachary; Engmann, Jorgen; Providencia, Rui; Schmidt, Amand Floriaan; Munroe, Patricia; Yang, Howard; Carroll, Andrew; Khawaja, Anthony P; McLean, Cory Y; Behsaz, Babak; Hormozdiari, Farhad.

medRxiv ; 2024 Mar 20.

Artigo em Inglês | MEDLINE | ID: mdl-38562791

RESUMO

Electronic health records, biobanks, and wearable biosensors contain multiple high-dimensional clinical data (HDCD) modalities (e.g., ECG, Photoplethysmography (PPG), and MRI) for each individual. Access to multimodal HDCD provides a unique opportunity for genetic studies of complex traits because different modalities relevant to a single physiological system (e.g., circulatory system) encode complementary and overlapping information. We propose a novel multimodal deep learning method, M-REGLE, for discovering genetic associations from a joint representation of multiple complementary HDCD modalities. We showcase the effectiveness of this model by applying it to several cardiovascular modalities. M-REGLE jointly learns a lower representation (i.e., latent factors) of multimodal HDCD using a convolutional variational autoencoder, performs genome wide association studies (GWAS) on each latent factor, then combines the results to study the genetics of the underlying system. To validate the advantages of M-REGLE and multimodal learning, we apply it to common cardiovascular modalities (PPG and ECG), and compare its results to unimodal learning methods in which representations are learned from each data modality separately, but the downstream genetic analyses are performed on the combined unimodal representations. M-REGLE identifies 19.3% more loci on the 12-lead ECG dataset, 13.0% more loci on the ECG lead I + PPG dataset, and its genetic risk score significantly outperforms the unimodal risk score at predicting cardiac phenotypes, such as atrial fibrillation (Afib), in multiple biobanks.

4.

Deep Learning Utilizing Suboptimal Spirometry Data to Improve Lung Function and Mortality Prediction in the UK Biobank.

Hill, Davin; Torop, Max; Masoomi, Aria; Castaldi, Peter J; Silverman, Edwin K; Bodduluri, Sandeep; Bhatt, Surya P; Yun, Taedong; McLean, Cory Y; Hormozdiari, Farhad; Dy, Jennifer; Cho, Michael H; Hobbs, Brian D.

medRxiv ; 2023 Apr 29.

Artigo em Inglês | MEDLINE | ID: mdl-37162978

RESUMO

Background: Spirometry measures lung function by selecting the best of multiple efforts meeting pre-specified quality control (QC), and reporting two key metrics: forced expiratory volume in 1 second (FEV1) and forced vital capacity (FVC). We hypothesize that discarded submaximal and QC-failing data meaningfully contribute to the prediction of airflow obstruction and all-cause mortality. Methods: We evaluated volume-time spirometry data from the UK Biobank. We identified "best" spirometry efforts as those passing QC with the maximum FVC. "Discarded" efforts were either submaximal or failed QC. To create a combined representation of lung function we implemented a contrastive learning approach, Spirogram-based Contrastive Learning Framework (Spiro-CLF), which utilized all recorded volume-time curves per participant and applied different transformations (e.g. flow-volume, flow-time). In a held-out 20% testing subset we applied the Spiro-CLF representation of a participant's overall lung function to 1) binary predictions of FEV1/FVC < 0.7 and FEV1 Percent Predicted (FEV1PP) < 80%, indicative of airflow obstruction, and 2) Cox regression for all-cause mortality. Findings: We included 940,705 volume-time curves from 352,684 UK Biobank participants with 2-3 spirometry efforts per individual (66.7% with 3 efforts) and at least one QC-passing spirometry effort. Of all spirometry efforts, 24.1% failed QC and 37.5% were submaximal. Spiro-CLF prediction of FEV1/FVC < 0.7 utilizing discarded spirometry efforts had an Area under the Receiver Operating Characteristics (AUROC) of 0.981 (0.863 for FEV1PP prediction). Incorporating discarded spirometry efforts in all-cause mortality prediction was associated with a concordance index (c-index) of 0.654, which exceeded the c-indices from FEV1 (0.590), FVC (0.559), or FEV1/FVC (0.599) from each participant's single best effort. Interpretation: A contrastive learning model using raw spirometry curves can accurately predict lung function using submaximal and QC-failing efforts. This model also has superior prediction of all-cause mortality compared to standard lung function measurements. Funding: MHC is supported by NIH R01HL137927, R01HL135142, HL147148, and HL089856.BDH is supported by NIH K08HL136928, U01 HL089856, and an Alpha-1 Foundation Research Grant.DH is supported by NIH 2T32HL007427-41EKS is supported by NIH R01 HL152728, R01 HL147148, U01 HL089856, R01 HL133135, P01 HL132825, and P01 HL114501.PJC is supported by NIH R01HL124233 and R01HL147326.SPB is supported by NIH R01HL151421 and UH3HL155806.TY, FH, and CYM are employees of Google LLC.

5.

DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer.

Baid, Gunjan; Cook, Daniel E; Shafin, Kishwar; Yun, Taedong; Llinares-López, Felipe; Berthet, Quentin; Belyaeva, Anastasiya; Töpfer, Armin; Wenger, Aaron M; Rowell, William J; Yang, Howard; Kolesnikov, Alexey; Ammar, Waleed; Vert, Jean-Philippe; Vaswani, Ashish; McLean, Cory Y; Nattestad, Maria; Chang, Pi-Chuan; Carroll, Andrew.

Nat Biotechnol ; 41(2): 232-238, 2023 02.

Artigo em Inglês | MEDLINE | ID: mdl-36050551

RESUMO

Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10-25 kilobases), accurate 'HiFi' reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer-encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (ï»¿NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA

6.

Unsupervised representation learning improves genomic discovery and risk prediction for respiratory and circulatory functions and diseases.

Yun, Taedong; Cosentino, Justin; Behsaz, Babak; McCaw, Zachary R; Hill, Davin; Luben, Robert; Lai, Dongbing; Bates, John; Yang, Howard; Schwantes-An, Tae-Hwi; Zhou, Yuchen; Khawaja, Anthony P; Carroll, Andrew; Hobbs, Brian D; Cho, Michael H; McLean, Cory Y; Hormozdiari, Farhad.

medRxiv ; 2023 Aug 29.

Artigo em Inglês | MEDLINE | ID: mdl-37163049

RESUMO

High-dimensional clinical data are becoming more accessible in biobank-scale datasets. However, effectively utilizing high-dimensional clinical data for genetic discovery remains challenging. Here we introduce a general deep learning-based framework, REpresentation learning for Genetic discovery on Low-dimensional Embeddings (REGLE), for discovering associations between genetic variants and high-dimensional clinical data. REGLE uses convolutional variational autoencoders to compute a non-linear, low-dimensional, disentangled embedding of the data with highly heritable individual components. REGLE can incorporate expert-defined or clinical features and provides a framework to create accurate disease-specific polygenic risk scores (PRS) in datasets which have minimal expert phenotyping. We apply REGLE to both respiratory and circulatory systems: spirograms which measure lung function and photoplethysmograms (PPG) which measure blood volume changes. Genome-wide association studies on REGLE embeddings identify more genome-wide significant loci than existing methods and replicate known loci for both spirograms and PPG, demonstrating the generality of the framework. Furthermore, these embeddings are associated with overall survival. Finally, we construct a set of PRSs that improve predictive performance of asthma, chronic obstructive pulmonary disease, hypertension, and systolic blood pressure in multiple biobanks. Thus, REGLE embeddings can quantify clinically relevant features that are not currently captured in a standardized or automated way.

7.

DeepNull models non-linear covariate effects to improve phenotypic prediction and association power.

McCaw, Zachary R; Colthurst, Thomas; Yun, Taedong; Furlotte, Nicholas A; Carroll, Andrew; Alipanahi, Babak; McLean, Cory Y; Hormozdiari, Farhad.

Nat Commun ; 13(1): 241, 2022 01 11.

Artigo em Inglês | MEDLINE | ID: mdl-35017556

RESUMO

Genome-wide association studies (GWASs) examine the association between genotype and phenotype while adjusting for a set of covariates. Although the covariates may have non-linear or interactive effects, due to the challenge of specifying the model, GWAS often neglect such terms. Here we introduce DeepNull, a method that identifies and adjusts for non-linear and interactive covariate effects using a deep neural network. In analyses of simulated and real data, we demonstrate that DeepNull maintains tight control of the type I error while increasing statistical power by up to 20% in the presence of non-linear and interactive effects. Moreover, in the absence of such effects, DeepNull incurs no loss of power. When applied to 10 phenotypes from the UK Biobank (n = 370K), DeepNull discovered more hits (+6%) and loci (+7%), on average, than conventional association analyses, many of which are biologically plausible or have previously been reported. Finally, DeepNull improves upon linear modeling for phenotypic prediction (+23% on average).

Assuntos

Estudo de Associação Genômica Ampla/métodos , Fenótipo , Simulação por Computador , Modelos Lineares , Projetos de Pesquisa

8.

A population-specific reference panel for improved genotype imputation in African Americans.

O'Connell, Jared; Yun, Taedong; Moreno, Meghan; Li, Helen; Litterman, Nadia; Kolesnikov, Alexey; Noblin, Elizabeth; Chang, Pi-Chuan; Shastri, Anjali; Dorfman, Elizabeth H; Shringarpure, Suyash; Auton, Adam; Carroll, Andrew; McLean, Cory Y.

Commun Biol ; 4(1): 1269, 2021 11 05.

Artigo em Inglês | MEDLINE | ID: mdl-34741098

RESUMO

There is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics.

Assuntos

Genoma Humano , Genótipo , Adulto , Negro ou Afro-Americano , Idoso , Idoso de 80 Anos ou mais , Humanos , Pessoa de Meia-Idade , Estados Unidos , Sequenciamento Completo do Genoma , Adulto Jovem

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA