RESUMO
As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.
Assuntos
Aprendizado Profundo , Genômica/métodos , Modelos Genéticos , Redes Neurais de Computação , Sequência de Bases , Simulação por Computador , Humanos , Aprendizado de Máquina Supervisionado , Aprendizado de Máquina não SupervisionadoRESUMO
How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.
Assuntos
DNA/genética , Bases de Dados Genéticas , Epigênese Genética , Regulação da Expressão Gênica , Aprendizado de Máquina , Rede Nervosa , Animais , Linhagem Celular , Genoma , Genômica/métodos , Humanos , Camundongos , Locos de Características QuantitativasRESUMO
The 5' untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5'UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)-a proxy for translation rate-directly from 5'UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5'UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5'UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants.
Assuntos
Regiões 5' não Traduzidas , Aprendizado Profundo , Ribossomos/metabolismo , Humanos , RNA Mensageiro/genéticaRESUMO
RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (Outlier in RNA-Seq Finder), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read-count expectations according to the gene covariation resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best recall of artificially corrupted data. Precision-recall analyses using simulated outlier read counts demonstrated the importance of controlling for covariation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a dataset, for identifying outlier samples with too many aberrantly expressed genes, and for detecting aberrant gene expression on the basis of false-discovery-rate-adjusted p values. Overall, OUTRIDER provides an end-to-end solution for identifying aberrantly expressed genes and is suitable for use by rare-disease diagnostic platforms.
Assuntos
Expressão Gênica/genética , Variação Genética/genética , RNA/metabolismo , Análise de Sequência de RNA/métodos , Algoritmos , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , HumanosRESUMO
MDH2 encodes mitochondrial malate dehydrogenase (MDH), which is essential for the conversion of malate to oxaloacetate as part of the proper functioning of the Krebs cycle. We report bi-allelic pathogenic mutations in MDH2 in three unrelated subjects presenting with early-onset generalized hypotonia, psychomotor delay, refractory epilepsy, and elevated lactate in the blood and cerebrospinal fluid. Functional studies in fibroblasts from affected subjects showed both an apparently complete loss of MDH2 levels and MDH2 enzymatic activity close to null. Metabolomics analyses demonstrated a significant concomitant accumulation of the MDH substrate, malate, and fumarate, its immediate precursor in the Krebs cycle, in affected subjects' fibroblasts. Lentiviral complementation with wild-type MDH2 cDNA restored MDH2 levels and mitochondrial MDH activity. Additionally, introduction of the three missense mutations from the affected subjects into Saccharomyces cerevisiae provided functional evidence to support their pathogenicity. Disruption of the Krebs cycle is a hallmark of cancer, and MDH2 has been recently identified as a novel pheochromocytoma and paraganglioma susceptibility gene. We show that loss-of-function mutations in MDH2 are also associated with severe neurological clinical presentations in children.
Assuntos
Encefalopatias/genética , Ciclo do Ácido Cítrico , Malato Desidrogenase/genética , Mutação , Idade de Início , Alelos , Sequência de Aminoácidos , Criança , Pré-Escolar , Ciclo do Ácido Cítrico/genética , Fibroblastos/enzimologia , Fibroblastos/metabolismo , Fumaratos/metabolismo , Teste de Complementação Genética , Humanos , Lactente , Recém-Nascido , Malato Desidrogenase/química , Malato Desidrogenase/metabolismo , Malatos/metabolismo , Masculino , Metabolômica , Modelos MolecularesRESUMO
Pathogenic genetic variants often primarily affect splicing. However, it remains difficult to quantitatively predict whether and how genetic variants affect splicing. In 2018, the fifth edition of the Critical Assessment of Genome Interpretation proposed two splicing prediction challenges based on experimental perturbation assays: Vex-seq, assessing exon skipping, and MaPSy, assessing splicing efficiency. We developed a modular modeling framework, MMSplice, the performance of which was among the best on both challenges. Here we provide insights into the modeling assumptions of MMSplice and its individual modules. We furthermore illustrate how MMSplice can be applied in practice for individual genome interpretation, using the MMSplice VEP plugin and the Kipoi variant interpretation plugin, which are directly applicable to VCF files.
Assuntos
Biologia Computacional/métodos , Variação Genética , Splicing de RNA , Congressos como Assunto , Éxons , Predisposição Genética para Doença , Humanos , Íntrons , Modelos Genéticos , SoftwareRESUMO
Precision medicine and sequence-based clinical diagnostics seek to predict disease risk or to identify causative variants from sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. In the past, few CAGI challenges have addressed the impact of sequence variants on splicing. In CAGI5, two challenges (Vex-seq and MaPSY) involved prediction of the effect of variants, primarily single-nucleotide changes, on splicing. Although there are significant differences between these two challenges, both involved prediction of results from high-throughput exon inclusion assays. Here, we discuss the methods used to predict the impact of these variants on splicing, their performance, strengths, and weaknesses, and prospects for predicting the impact of sequence variation on splicing and disease phenotypes.
Assuntos
Processamento Alternativo , Biologia Computacional/métodos , Mutação , Proteínas/genética , Animais , Congressos como Assunto , Aptidão Genética , Humanos , Modelos Genéticos , Homologia de Sequência do Ácido NucleicoRESUMO
The stability of mRNA is one of the major determinants of gene expression. Although a wealth of sequence elements regulating mRNA stability has been described, their quantitative contributions to half-life are unknown. Here, we built a quantitative model for Saccharomyces cerevisiae based on functional mRNA sequence features that explains 59% of the half-life variation between genes and predicts half-life at a median relative error of 30%. The model revealed a new destabilizing 3' UTR motif, ATATTC, which we functionally validated. Codon usage proves to be the major determinant of mRNA stability. Nonetheless, single-nucleotide variations have the largest effect when occurring on 3' UTR motifs or upstream AUGs. Analyzing mRNA half-life data of 34 knockout strains showed that the effect of codon usage not only requires functional decapping and deadenylation, but also the 5'-to-3' exonuclease Xrn1, the nonsense-mediated decay genes, but not no-go decay. Altogether, this study quantitatively delineates the contributions of mRNA sequence features on stability in yeast, reveals their functional dependencies on degradation pathways, and allows accurate prediction of half-life from mRNA sequence.
Assuntos
Estabilidade de RNA/genética , RNA Fúngico/genética , RNA Fúngico/metabolismo , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Regiões 3' não Traduzidas/genética , Sequência de Bases , Códon/genética , Códon/metabolismo , Técnicas de Inativação de Genes , Genes Fúngicos , Meia-Vida , Modelos Biológicos , Degradação do RNAm Mediada por Códon sem Sentido/genética , Iniciação Traducional da Cadeia Peptídica , Elementos Reguladores de Transcrição , Schizosaccharomyces/genética , Schizosaccharomyces/metabolismoRESUMO
Motivation: Regulatory sequences are not solely defined by their nucleic acid sequence but also by their relative distances to genomic landmarks such as transcription start site, exon boundaries or polyadenylation site. Deep learning has become the approach of choice for modeling regulatory sequences because of its strength to learn complex sequence features. However, modeling relative distances to genomic landmarks in deep neural networks has not been addressed. Results: Here we developed spline transformation, a neural network module based on splines to flexibly and robustly model distances. Modeling distances to various genomic landmarks with spline transformations significantly increased state-of-the-art prediction accuracy of in vivo RNA-binding protein binding sites for 120 out of 123 proteins. We also developed a deep neural network for human splice branchpoint based on spline transformations that outperformed the current best, already distance-based, machine learning model. Compared to piecewise linear transformation, as obtained by composition of rectified linear units, spline transformation yields higher prediction accuracy as well as faster and more robust training. As spline transformation can be applied to further quantities beyond distances, such as methylation or conservation, we foresee it as a versatile component in the genomics deep learning toolbox. Availability and implementation: Spline transformation is implemented as a Keras layer in the CONCISE python package: https://github.com/gagneurlab/concise. Analysis code is available at https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017. Contact: avsec@in.tum.de or gagneur@in.tum.de. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica/métodos , Modelos Genéticos , Redes Neurais de Computação , Sequências Reguladoras de Ácido Nucleico , DNA , Células Hep G2 , Humanos , Células K562 , Aprendizado de Máquina , Ligação Proteica , Proteínas/metabolismo , RNA , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos , SoftwareRESUMO
The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.
Assuntos
Substituição de Aminoácidos , Doença , Mutação de Sentido Incorreto , Proteoma , Alinhamento de Sequência , Humanos , Substituição de Aminoácidos/genética , Benchmarking , Sequência Conservada , Bases de Dados Genéticas , Doença/genética , Genoma Humano , Conformação Proteica , Proteoma/genética , Alinhamento de Sequência/métodos , Aprendizado de MáquinaRESUMO
Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.
RESUMO
Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.
Assuntos
Conjuntos de Dados como Assunto/normas , Aprendizado Profundo , Especificidade de Órgãos , Análise de Célula Única/normas , Animais , COVID-19/patologia , Humanos , Camundongos , Padrões de Referência , SARS-CoV-2/patogenicidadeRESUMO
The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.
Assuntos
Biologia Computacional/métodos , Motivos de Nucleotídeos , Fatores de Transcrição/metabolismo , Animais , Sítios de Ligação , Imunoprecipitação da Cromatina , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Aprendizado Profundo , Camundongos , Células-Tronco Embrionárias Murinas/fisiologia , Proteína Homeobox Nanog/metabolismo , Redes Neurais de Computação , Fator 3 de Transcrição de Octâmero/metabolismo , Reprodutibilidade dos Testes , Fatores de Transcrição SOXB1/metabolismoRESUMO
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files.