Pesquisa | BVS - MINISTÉRIO DA SAÚDE

Deep Learning Sequence Models for Transcriptional Regulation.

Sokolova, Ksenia; Chen, Kathleen M; Hao, Yun; Zhou, Jian; Troyanskaya, Olga G.

Annu Rev Genomics Hum Genet ; 2024 Apr 09.

Artigo em Inglês | MEDLINE | ID: mdl-38594933

RESUMO

Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.

A sequence-based global map of regulatory activity for deciphering human genetics.

Chen, Kathleen M; Wong, Aaron K; Troyanskaya, Olga G; Zhou, Jian.

Nat Genet ; 54(7): 940-949, 2022 07.

Artigo em Inglês | MEDLINE | ID: mdl-35817977

RESUMO

Epigenomic profiling has enabled large-scale identification of regulatory elements, yet we still lack a systematic mapping from any sequence or variant to regulatory activities. We address this challenge with Sei, a framework for integrating human genetics data with sequence information to discover the regulatory basis of traits and diseases. Sei learns a vocabulary of regulatory activities, called sequence classes, using a deep learning model that predicts 21,907 chromatin profiles across >1,300 cell lines and tissues. Sequence classes provide a global classification and quantification of sequence and variant effects based on diverse regulatory activities, such as cell type-specific enhancer functions. These predictions are supported by tissue-specific expression, expression quantitative trait loci and evolutionary constraint data. Furthermore, sequence classes enable characterization of the tissue-specific, regulatory architecture of complex traits and generate mechanistic hypotheses for individual regulatory pathogenic mutations. We provide Sei as a resource to elucidate the regulatory basis of human health and disease.

Assuntos

Locos de Características Quantitativas , Sequências Reguladoras de Ácido Nucleico , Cromatina/genética , Epigenômica , Genética Humana , Humanos , Locos de Características Quantitativas/genética , Sequências Reguladoras de Ácido Nucleico/genética

Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk.

Park, Christopher Y; Zhou, Jian; Wong, Aaron K; Chen, Kathleen M; Theesfeld, Chandra L; Darnell, Robert B; Troyanskaya, Olga G.

Nat Genet ; 53(2): 166-173, 2021 02.

Artigo em Inglês | MEDLINE | ID: mdl-33462483

RESUMO

Despite the strong genetic basis of psychiatric disorders, the underlying molecular mechanisms are largely unmapped. RNA-binding proteins (RBPs) are responsible for most post-transcriptional regulation, from splicing to translation to localization. RBPs thus act as key gatekeepers of cellular homeostasis, especially in the brain. However, quantifying the pathogenic contribution of noncoding variants impacting RBP target sites is challenging. Here, we leverage a deep learning approach that can accurately predict the RBP target site dysregulation effects of mutations and discover that RBP dysregulation is a principal contributor to psychiatric disorder risk. RBP dysregulation explains a substantial amount of heritability not captured by large-scale molecular quantitative trait loci studies and has a stronger impact than common coding region variants. We share the genome-wide profiles of RBP dysregulation, which we use to identify DDHD2 as a candidate schizophrenia risk gene. This resource provides a new analytical framework to connect the full range of RNA regulation to complex disease.

Assuntos

Transtornos Mentais/genética , Fosfolipases/genética , Proteínas de Ligação a RNA/genética , Regiões 3' não Traduzidas , Aprendizado Profundo , Regulação da Expressão Gênica , Frequência do Gene , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Mutação , Proteínas do Fator Nuclear 90/genética , Fatores de Alongamento de Peptídeos/genética , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , RNA Helicases/genética , Processamento Pós-Transcricional do RNA , Ribonucleoproteína Nuclear Pequena U5/genética , Esquizofrenia/genética , Transativadores/genética

Genomic analyses implicate noncoding de novo variants in congenital heart disease.

Richter, Felix; Morton, Sarah U; Kim, Seong Won; Kitaygorodsky, Alexander; Wasson, Lauren K; Chen, Kathleen M; Zhou, Jian; Qi, Hongjian; Patel, Nihir; DePalma, Steven R; Parfenov, Michael; Homsy, Jason; Gorham, Joshua M; Manheimer, Kathryn B; Velinder, Matthew; Farrell, Andrew; Marth, Gabor; Schadt, Eric E; Kaltman, Jonathan R; Newburger, Jane W; Giardini, Alessandro; Goldmuntz, Elizabeth; Brueckner, Martina; Kim, Richard; Porter, George A; Bernstein, Daniel; Chung, Wendy K; Srivastava, Deepak; Tristani-Firouzi, Martin; Troyanskaya, Olga G; Dickel, Diane E; Shen, Yufeng; Seidman, Jonathan G; Seidman, Christine E; Gelb, Bruce D.

Nat Genet ; 52(8): 769-777, 2020 08.

Artigo em Inglês | MEDLINE | ID: mdl-32601476

RESUMO

A genetic etiology is identified for one-third of patients with congenital heart disease (CHD), with 8% of cases attributable to coding de novo variants (DNVs). To assess the contribution of noncoding DNVs to CHD, we compared genome sequences from 749 CHD probands and their parents with those from 1,611 unaffected trios. Neural network prediction of noncoding DNV transcriptional impact identified a burden of DNVs in individuals with CHD (n = 2,238 DNVs) compared to controls (n = 4,177; P = 8.7 × 10-4). Independent analyses of enhancers showed an excess of DNVs in associated genes (27 genes versus 3.7 expected, P = 1 × 10-5). We observed significant overlap between these transcription-based approaches (odds ratio (OR) = 2.5, 95% confidence interval (CI) 1.1-5.0, P = 5.4 × 10-3). CHD DNVs altered transcription levels in 5 of 31 enhancers assayed. Finally, we observed a DNV burden in RNA-binding-protein regulatory sites (OR = 1.13, 95% CI 1.1-1.2, P = 8.8 × 10-5). Our findings demonstrate an enrichment of potentially disruptive regulatory noncoding DNVs in a fraction of CHD at least as high as that observed for damaging coding DNVs.

Assuntos

Variação Genética/genética , Cardiopatias Congênitas/genética , RNA não Traduzido/genética , Adolescente , Adulto , Animais , Feminino , Predisposição Genética para Doença/genética , Genômica , Coração/fisiologia , Humanos , Masculino , Camundongos , Pessoa de Meia-Idade , Fases de Leitura Aberta/genética , Proteínas de Ligação a RNA/genética , Transcrição Gênica/genética , Adulto Jovem

Selene: a PyTorch-based deep learning library for sequence data.

Chen, Kathleen M; Cofer, Evan M; Zhou, Jian; Troyanskaya, Olga G.

Nat Methods ; 16(4): 315-318, 2019 04.

Artigo em Inglês | MEDLINE | ID: mdl-30923381

RESUMO

To enable the application of deep learning in biology, we present Selene (https://selene.flatironinstitute.org/), a PyTorch-based deep learning library for fast and easy development, training, and application of deep learning model architectures for any biological sequence data. We demonstrate on DNA sequences how Selene allows researchers to easily train a published architecture on new data, develop and evaluate a new architecture, and use a trained model to answer biological questions of interest.

Assuntos

Biologia Computacional/métodos , Aprendizado Profundo , Redes Neurais de Computação , Análise de Sequência de DNA , Algoritmos , Doença de Alzheimer/metabolismo , Área Sob a Curva , Biblioteca Gênica , Genômica , Humanos , Modelos Estatísticos , Mutagênese , Mutação , Distribuição Normal , Linguagens de Programação , Software

PathCORE-T: identifying and visualizing globally co-occurring pathways in large transcriptomic compendia.

Chen, Kathleen M; Tan, Jie; Way, Gregory P; Doing, Georgia; Hogan, Deborah A; Greene, Casey S.

BioData Min ; 11: 14, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29988723

RESUMO

BACKGROUND: Investigators often interpret genome-wide data by analyzing the expression levels of genes within pathways. While this within-pathway analysis is routine, the products of any one pathway can affect the activity of other pathways. Past efforts to identify relationships between biological processes have evaluated overlap in knowledge bases or evaluated changes that occur after specific treatments. Individual experiments can highlight condition-specific pathway-pathway relationships; however, constructing a complete network of such relationships across many conditions requires analyzing results from many studies. RESULTS: We developed PathCORE-T framework by implementing existing methods to identify pathway-pathway transcriptional relationships evident across a broad data compendium. PathCORE-T is applied to the output of feature construction algorithms; it identifies pairs of pathways observed in features more than expected by chance as functionally co-occurring. We demonstrate PathCORE-T by analyzing an existing eADAGE model of a microbial compendium and building and analyzing NMF features from the TCGA dataset of 33 cancer types. The PathCORE-T framework includes a demonstration web interface, with source code, that users can launch to (1) visualize the network and (2) review the expression levels of associated genes in the original data. PathCORE-T creates and displays the network of globally co-occurring pathways based on features observed in a machine learning analysis of gene expression data. CONCLUSIONS: The PathCORE-T framework identifies transcriptionally co-occurring pathways from the results of unsupervised analysis of gene expression data and visualizes the relationships between pathways as a network. PathCORE-T recapitulated previously described pathway-pathway relationships and suggested experimentally testable additional hypotheses that remain to be explored.

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk.

Zhou, Jian; Theesfeld, Chandra L; Yao, Kevin; Chen, Kathleen M; Wong, Aaron K; Troyanskaya, Olga G.

Nat Genet ; 50(8): 1171-1179, 2018 08.

Artigo em Inglês | MEDLINE | ID: mdl-30013180

RESUMO

Key challenges for human genetics, precision medicine and evolutionary biology include deciphering the regulatory code of gene expression and understanding the transcriptional effects of genome variation. However, this is extremely difficult because of the enormous scale of the noncoding mutation space. We developed a deep learning-based framework, ExPecto, that can accurately predict, ab initio from a DNA sequence, the tissue-specific transcriptional effects of mutations, including those that are rare or that have not been observed. We prioritized causal variants within disease- or trait-associated loci from all publicly available genome-wide association studies and experimentally validated predictions for four immune-related diseases. By exploiting the scalability of ExPecto, we characterized the regulatory mutation space for human RNA polymerase II-transcribed genes by in silico saturation mutagenesis and profiled > 140 million promoter-proximal mutations. This enables probing of evolutionary constraints on gene expression and ab initio prediction of mutation disease effects, making ExPecto an end-to-end computational framework for the in silico prediction of expression and disease risk.

Assuntos

Aprendizado Profundo , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/métodos , Mutação , Algoritmos , Simulação por Computador , Expressão Gênica , Humanos , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Regiões Promotoras Genéticas , Locos de Características Quantitativas/genética

Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks.

Tan, Jie; Doing, Georgia; Lewis, Kimberley A; Price, Courtney E; Chen, Kathleen M; Cady, Kyle C; Perchuk, Barret; Laub, Michael T; Hogan, Deborah A; Greene, Casey S.

Cell Syst ; 5(1): 63-71.e6, 2017 07 26.

Artigo em Inglês | MEDLINE | ID: mdl-28711280

RESUMO

Cross-experiment comparisons in public data compendia are challenged by unmatched conditions and technical noise. The ADAGE method, which performs unsupervised integration with denoising autoencoder neural networks, can identify biological patterns, but because ADAGE models, like many neural networks, are over-parameterized, different ADAGE models perform equally well. To enhance model robustness and better build signatures consistent with biological pathways, we developed an ensemble ADAGE (eADAGE) that integrated stable signatures across models. We applied eADAGE to a compendium of Pseudomonas aeruginosa gene expression profiling experiments performed in 78 media. eADAGE revealed a phosphate starvation response controlled by PhoB in media with moderate phosphate and predicted that a second stimulus provided by the sensor kinase, KinB, is required for this PhoB activation. We validated this relationship using both targeted and unbiased genetic approaches. eADAGE, which captures stable biological patterns, enables cross-experiment comparisons that can highlight measured but undiscovered relationships.

Assuntos

Proteínas de Bactérias/metabolismo , Redes Neurais de Computação , Pseudomonas aeruginosa/fisiologia , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Conhecimentos, Atitudes e Prática em Saúde , Humanos , Armazenamento e Recuperação da Informação/tendências , Setor Público , Inanição , Integração de Sistemas , Transcriptoma

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA