Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
Nat Rev Genet ; 20(7): 389-403, 2019 07.
Artículo en Inglés | MEDLINE | ID: mdl-30971806

RESUMEN

As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.


Asunto(s)
Aprendizaje Profundo , Genómica/métodos , Modelos Genéticos , Redes Neurales de la Computación , Secuencia de Bases , Simulación por Computador , Humanos , Aprendizaje Automático Supervisado , Aprendizaje Automático no Supervisado
2.
Nat Methods ; 18(10): 1196-1203, 2021 10.
Artículo en Inglés | MEDLINE | ID: mdl-34608324

RESUMEN

How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.


Asunto(s)
ADN/genética , Bases de Datos Genéticas , Epigénesis Genética , Regulación de la Expresión Génica , Aprendizaje Automático , Red Nerviosa , Animales , Línea Celular , Genoma , Genómica/métodos , Humanos , Ratones , Sitios de Carácter Cuantitativo
3.
PLoS Comput Biol ; 17(5): e1008982, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-33970899

RESUMEN

The 5' untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5'UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)-a proxy for translation rate-directly from 5'UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5'UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5'UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants.


Asunto(s)
Regiones no Traducidas 5' , Aprendizaje Profundo , Ribosomas/metabolismo , Humanos , ARN Mensajero/genética
4.
Am J Hum Genet ; 103(6): 907-917, 2018 12 06.
Artículo en Inglés | MEDLINE | ID: mdl-30503520

RESUMEN

RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (Outlier in RNA-Seq Finder), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read-count expectations according to the gene covariation resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best recall of artificially corrupted data. Precision-recall analyses using simulated outlier read counts demonstrated the importance of controlling for covariation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a dataset, for identifying outlier samples with too many aberrantly expressed genes, and for detecting aberrant gene expression on the basis of false-discovery-rate-adjusted p values. Overall, OUTRIDER provides an end-to-end solution for identifying aberrantly expressed genes and is suitable for use by rare-disease diagnostic platforms.


Asunto(s)
Expresión Génica/genética , Variación Genética/genética , ARN/metabolismo , Análisis de Secuencia de ARN/métodos , Algoritmos , Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos
5.
Am J Hum Genet ; 100(1): 151-159, 2017 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-27989324

RESUMEN

MDH2 encodes mitochondrial malate dehydrogenase (MDH), which is essential for the conversion of malate to oxaloacetate as part of the proper functioning of the Krebs cycle. We report bi-allelic pathogenic mutations in MDH2 in three unrelated subjects presenting with early-onset generalized hypotonia, psychomotor delay, refractory epilepsy, and elevated lactate in the blood and cerebrospinal fluid. Functional studies in fibroblasts from affected subjects showed both an apparently complete loss of MDH2 levels and MDH2 enzymatic activity close to null. Metabolomics analyses demonstrated a significant concomitant accumulation of the MDH substrate, malate, and fumarate, its immediate precursor in the Krebs cycle, in affected subjects' fibroblasts. Lentiviral complementation with wild-type MDH2 cDNA restored MDH2 levels and mitochondrial MDH activity. Additionally, introduction of the three missense mutations from the affected subjects into Saccharomyces cerevisiae provided functional evidence to support their pathogenicity. Disruption of the Krebs cycle is a hallmark of cancer, and MDH2 has been recently identified as a novel pheochromocytoma and paraganglioma susceptibility gene. We show that loss-of-function mutations in MDH2 are also associated with severe neurological clinical presentations in children.


Asunto(s)
Encefalopatías/genética , Ciclo del Ácido Cítrico , Malato Deshidrogenasa/genética , Mutación , Edad de Inicio , Alelos , Secuencia de Aminoácidos , Niño , Preescolar , Ciclo del Ácido Cítrico/genética , Fibroblastos/enzimología , Fibroblastos/metabolismo , Fumaratos/metabolismo , Prueba de Complementación Genética , Humanos , Lactante , Recién Nacido , Malato Deshidrogenasa/química , Malato Deshidrogenasa/metabolismo , Malatos/metabolismo , Masculino , Metabolómica , Modelos Moleculares
6.
Hum Mutat ; 40(9): 1243-1251, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31070280

RESUMEN

Pathogenic genetic variants often primarily affect splicing. However, it remains difficult to quantitatively predict whether and how genetic variants affect splicing. In 2018, the fifth edition of the Critical Assessment of Genome Interpretation proposed two splicing prediction challenges based on experimental perturbation assays: Vex-seq, assessing exon skipping, and MaPSy, assessing splicing efficiency. We developed a modular modeling framework, MMSplice, the performance of which was among the best on both challenges. Here we provide insights into the modeling assumptions of MMSplice and its individual modules. We furthermore illustrate how MMSplice can be applied in practice for individual genome interpretation, using the MMSplice VEP plugin and the Kipoi variant interpretation plugin, which are directly applicable to VCF files.


Asunto(s)
Biología Computacional/métodos , Variación Genética , Empalme del ARN , Congresos como Asunto , Exones , Predisposición Genética a la Enfermedad , Humanos , Intrones , Modelos Genéticos , Programas Informáticos
7.
Hum Mutat ; 40(9): 1215-1224, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31301154

RESUMEN

Precision medicine and sequence-based clinical diagnostics seek to predict disease risk or to identify causative variants from sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. In the past, few CAGI challenges have addressed the impact of sequence variants on splicing. In CAGI5, two challenges (Vex-seq and MaPSY) involved prediction of the effect of variants, primarily single-nucleotide changes, on splicing. Although there are significant differences between these two challenges, both involved prediction of results from high-throughput exon inclusion assays. Here, we discuss the methods used to predict the impact of these variants on splicing, their performance, strengths, and weaknesses, and prospects for predicting the impact of sequence variation on splicing and disease phenotypes.


Asunto(s)
Empalme Alternativo , Biología Computacional/métodos , Mutación , Proteínas/genética , Animales , Congresos como Asunto , Aptitud Genética , Humanos , Modelos Genéticos , Homología de Secuencia de Ácido Nucleico
8.
RNA ; 23(11): 1648-1659, 2017 11.
Artículo en Inglés | MEDLINE | ID: mdl-28802259

RESUMEN

The stability of mRNA is one of the major determinants of gene expression. Although a wealth of sequence elements regulating mRNA stability has been described, their quantitative contributions to half-life are unknown. Here, we built a quantitative model for Saccharomyces cerevisiae based on functional mRNA sequence features that explains 59% of the half-life variation between genes and predicts half-life at a median relative error of 30%. The model revealed a new destabilizing 3' UTR motif, ATATTC, which we functionally validated. Codon usage proves to be the major determinant of mRNA stability. Nonetheless, single-nucleotide variations have the largest effect when occurring on 3' UTR motifs or upstream AUGs. Analyzing mRNA half-life data of 34 knockout strains showed that the effect of codon usage not only requires functional decapping and deadenylation, but also the 5'-to-3' exonuclease Xrn1, the nonsense-mediated decay genes, but not no-go decay. Altogether, this study quantitatively delineates the contributions of mRNA sequence features on stability in yeast, reveals their functional dependencies on degradation pathways, and allows accurate prediction of half-life from mRNA sequence.


Asunto(s)
Estabilidad del ARN/genética , ARN de Hongos/genética , ARN de Hongos/metabolismo , ARN Mensajero/genética , ARN Mensajero/metabolismo , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Regiones no Traducidas 3'/genética , Secuencia de Bases , Codón/genética , Codón/metabolismo , Técnicas de Inactivación de Genes , Genes Fúngicos , Semivida , Modelos Biológicos , Degradación de ARNm Mediada por Codón sin Sentido/genética , Iniciación de la Cadena Peptídica Traduccional , Elementos Reguladores de la Transcripción , Schizosaccharomyces/genética , Schizosaccharomyces/metabolismo
9.
Bioinformatics ; 34(8): 1261-1269, 2018 04 15.
Artículo en Inglés | MEDLINE | ID: mdl-29155928

RESUMEN

Motivation: Regulatory sequences are not solely defined by their nucleic acid sequence but also by their relative distances to genomic landmarks such as transcription start site, exon boundaries or polyadenylation site. Deep learning has become the approach of choice for modeling regulatory sequences because of its strength to learn complex sequence features. However, modeling relative distances to genomic landmarks in deep neural networks has not been addressed. Results: Here we developed spline transformation, a neural network module based on splines to flexibly and robustly model distances. Modeling distances to various genomic landmarks with spline transformations significantly increased state-of-the-art prediction accuracy of in vivo RNA-binding protein binding sites for 120 out of 123 proteins. We also developed a deep neural network for human splice branchpoint based on spline transformations that outperformed the current best, already distance-based, machine learning model. Compared to piecewise linear transformation, as obtained by composition of rectified linear units, spline transformation yields higher prediction accuracy as well as faster and more robust training. As spline transformation can be applied to further quantities beyond distances, such as methylation or conservation, we foresee it as a versatile component in the genomics deep learning toolbox. Availability and implementation: Spline transformation is implemented as a Keras layer in the CONCISE python package: https://github.com/gagneurlab/concise. Analysis code is available at https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017. Contact: avsec@in.tum.de or gagneur@in.tum.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica/métodos , Modelos Genéticos , Redes Neurales de la Computación , Secuencias Reguladoras de Ácidos Nucleicos , ADN , Células Hep G2 , Humanos , Células K562 , Aprendizaje Automático , Unión Proteica , Proteínas/metabolismo , ARN , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ARN/métodos , Programas Informáticos
10.
Science ; 381(6664): eadg7492, 2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37733863

RESUMEN

The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.


Asunto(s)
Sustitución de Aminoácidos , Enfermedad , Mutación Missense , Proteoma , Alineación de Secuencia , Humanos , Sustitución de Aminoácidos/genética , Benchmarking , Secuencia Conservada , Bases de Datos Genéticas , Enfermedad/genética , Genoma Humano , Conformación Proteica , Proteoma/genética , Alineación de Secuencia/métodos , Aprendizaje Automático
11.
bioRxiv ; 2023 Nov 13.
Artículo en Inglés | MEDLINE | ID: mdl-38014075

RESUMEN

Identifying transcriptional enhancers and their target genes is essential for understanding gene regulation and the impact of human genetic variation on disease1-6. Here we create and evaluate a resource of >13 million enhancer-gene regulatory interactions across 352 cell types and tissues, by integrating predictive models, measurements of chromatin state and 3D contacts, and largescale genetic perturbations generated by the ENCODE Consortium7. We first create a systematic benchmarking pipeline to compare predictive models, assembling a dataset of 10,411 elementgene pairs measured in CRISPR perturbation experiments, >30,000 fine-mapped eQTLs, and 569 fine-mapped GWAS variants linked to a likely causal gene. Using this framework, we develop a new predictive model, ENCODE-rE2G, that achieves state-of-the-art performance across multiple prediction tasks, demonstrating a strategy involving iterative perturbations and supervised machine learning to build increasingly accurate predictive models of enhancer regulation. Using the ENCODE-rE2G model, we build an encyclopedia of enhancer-gene regulatory interactions in the human genome, which reveals global properties of enhancer networks, identifies differences in the functions of genes that have more or less complex regulatory landscapes, and improves analyses to link noncoding variants to target genes and cell types for common, complex diseases. By interpreting the model, we find evidence that, beyond enhancer activity and 3D enhancer-promoter contacts, additional features guide enhancerpromoter communication including promoter class and enhancer-enhancer synergy. Altogether, these genome-wide maps of enhancer-gene regulatory interactions, benchmarking software, predictive models, and insights about enhancer function provide a valuable resource for future studies of gene regulation and human genetics.

12.
Nat Biotechnol ; 40(1): 121-130, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34462589

RESUMEN

Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.


Asunto(s)
Conjuntos de Datos como Asunto/normas , Aprendizaje Profundo , Especificidad de Órganos , Análisis de la Célula Individual/normas , Animales , COVID-19/patología , Humanos , Ratones , Estándares de Referencia , SARS-CoV-2/patogenicidad
13.
Nat Genet ; 53(3): 354-366, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33603233

RESUMEN

The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.


Asunto(s)
Biología Computacional/métodos , Motivos de Nucleótidos , Factores de Transcripción/metabolismo , Animales , Sitios de Unión , Inmunoprecipitación de Cromatina , Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas , Aprendizaje Profundo , Ratones , Células Madre Embrionarias de Ratones/fisiología , Proteína Homeótica Nanog/metabolismo , Redes Neurales de la Computación , Factor 3 de Transcripción de Unión a Octámeros/metabolismo , Reproducibilidad de los Resultados , Factores de Transcripción SOXB1/metabolismo
14.
Genome Biol ; 20(1): 48, 2019 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-30823901

RESUMEN

Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files.


Asunto(s)
Empalme Alternativo , Variación Genética , Modelos Genéticos , Redes Neurales de la Computación , Enfermedades Genéticas Congénitas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA