Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 96
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Cell ; 155(5): 1075-87, 2013 Nov 21.
Article in English | MEDLINE | ID: mdl-24210918

ABSTRACT

Pervasive transcription of eukaryotic genomes stems to a large extent from bidirectional promoters that synthesize mRNA and divergent noncoding RNA (ncRNA). Here, we show that ncRNA transcription in the yeast S. cerevisiae is globally restricted by early termination that relies on the essential RNA-binding factor Nrd1. Depletion of Nrd1 from the nucleus results in 1,526 Nrd1-unterminated transcripts (NUTs) that originate from nucleosome-depleted regions (NDRs) and can deregulate mRNA synthesis by antisense repression and transcription interference. Transcriptome-wide Nrd1-binding maps reveal divergent NUTs at most promoters and antisense NUTs in most 3' regions of genes. Nrd1 and its partner Nab3 preferentially bind RNA motifs that are depleted in mRNAs and enriched in ncRNAs and some mRNAs whose synthesis is controlled by transcription attenuation. These results define a global mechanism for transcriptome surveillance that selectively terminates ncRNA synthesis to provide promoter directionality and to suppress antisense transcription.


Subject(s)
RNA, Fungal/genetics , RNA, Untranslated/genetics , RNA-Binding Proteins/metabolism , Saccharomyces cerevisiae Proteins/metabolism , Saccharomyces cerevisiae/metabolism , Transcription Termination, Genetic , Transcriptome , Down-Regulation , Nuclear Proteins/metabolism , Promoter Regions, Genetic , RNA, Antisense/metabolism , Saccharomyces cerevisiae/genetics
2.
Nat Methods ; 21(1): 28-31, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38049697

ABSTRACT

Single-cell ATAC sequencing coverage in regulatory regions is typically binarized as an indicator of open chromatin. Here we show that binarization is an unnecessary step that neither improves goodness of fit, clustering, cell type identification nor batch integration. Fragment counts, but not read counts, should instead be modeled, which preserves quantitative regulatory information. These results have immediate implications for single-cell ATAC sequencing analysis.


Subject(s)
Chromatin Immunoprecipitation Sequencing , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Chromatin/genetics , Single-Cell Analysis
3.
Am J Hum Genet ; 110(12): 2056-2067, 2023 Dec 07.
Article in English | MEDLINE | ID: mdl-38006880

ABSTRACT

Detection of aberrantly spliced genes is an important step in RNA-seq-based rare-disease diagnostics. We recently developed FRASER, a denoising autoencoder-based method that outperformed alternative methods of detecting aberrant splicing. However, because FRASER's three splice metrics are partially redundant and tend to be sensitive to sequencing depth, we introduce here a more robust intron-excision metric, the intron Jaccard index, that combines the alternative donor, alternative acceptor, and intron-retention signal into a single value. Moreover, we optimized model parameters and filter cutoffs by using candidate rare-splice-disrupting variants as independent evidence. On 16,213 GTEx samples, our improved algorithm, FRASER 2.0, called typically 10 times fewer splicing outliers while increasing the proportion of candidate rare-splice-disrupting variants by 10-fold and substantially decreasing the effect of sequencing depth on the number of reported outliers. To lower the multiple-testing correction burden, we introduce an option to select the genes to be tested for each sample instead of a transcriptome-wide approach. This option can be particularly useful when prior information, such as candidate variants or genes, is available. Application on 303 rare-disease samples confirmed the relative reduction in the number of outlier calls for a slight loss of sensitivity; FRASER 2.0 recovered 22 out of 26 previously identified pathogenic splicing cases with default cutoffs and 24 when multiple-testing correction was limited to OMIM genes containing rare variants. Altogether, these methodological improvements contribute to more effective RNA-seq-based rare diagnostics by drastically reducing the amount of splicing outlier calls per sample at minimal loss of sensitivity.


Subject(s)
Alternative Splicing , RNA Splicing , Humans , Alternative Splicing/genetics , Introns/genetics , RNA Splicing/genetics , RNA-Seq , Algorithms
4.
Mol Syst Biol ; 20(5): 506-520, 2024 May.
Article in English | MEDLINE | ID: mdl-38491213

ABSTRACT

Codon optimality is a major determinant of mRNA translation and degradation rates. However, whether and through which mechanisms its effects are regulated remains poorly understood. Here we show that codon optimality associates with up to 2-fold change in mRNA stability variations between human tissues, and that its effect is attenuated in tissues with high energy metabolism and amplifies with age. Mathematical modeling and perturbation data through oxygen deprivation and ATP synthesis inhibition reveal that cellular energy variations non-uniformly alter the effect of codon usage. This new mode of codon effect regulation, independent of tRNA regulation, provides a fundamental mechanistic link between cellular energy metabolism and eukaryotic gene expression.


Subject(s)
Codon , Energy Metabolism , RNA Stability , RNA, Messenger , Humans , Energy Metabolism/genetics , RNA, Messenger/genetics , RNA, Messenger/metabolism , Codon/genetics , Codon Usage , Protein Biosynthesis , RNA, Transfer/genetics , RNA, Transfer/metabolism , Adenosine Triphosphate/metabolism , Gene Expression Regulation
5.
Nat Rev Genet ; 20(7): 389-403, 2019 07.
Article in English | MEDLINE | ID: mdl-30971806

ABSTRACT

As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.


Subject(s)
Deep Learning , Genomics/methods , Models, Genetic , Neural Networks, Computer , Base Sequence , Computer Simulation , Humans , Supervised Machine Learning , Unsupervised Machine Learning
6.
Nucleic Acids Res ; 51(4): e21, 2023 02 28.
Article in English | MEDLINE | ID: mdl-36617985

ABSTRACT

Transposon screens are powerful in vivo assays used to identify loci driving carcinogenesis. These loci are identified as Common Insertion Sites (CISs), i.e. regions with more transposon insertions than expected by chance. However, the identification of CISs is affected by biases in the insertion behaviour of transposon systems. Here, we introduce Transmicron, a novel method that differs from previous methods by (i) modelling neutral insertion rates based on chromatin accessibility, transcriptional activity and sequence context and (ii) estimating oncogenic selection for each genomic region using Poisson regression to model insertion counts while controlling for neutral insertion rates. To assess the benefits of our approach, we generated a dataset applying two different transposon systems under comparable conditions. Benchmarking for enrichment of known cancer genes showed improved performance of Transmicron against state-of-the-art methods. Modelling neutral insertion rates allowed for better control of false positives and stronger agreement of the results between transposon systems. Moreover, using Poisson regression to consider intra-sample and inter-sample information proved beneficial in small and moderately-sized datasets. Transmicron is open-source and freely available. Overall, this study contributes to the understanding of transposon biology and introduces a novel approach to use this knowledge for discovering cancer driver genes.


Subject(s)
DNA Transposable Elements , Neoplasms , Software , Humans , Base Sequence , Carcinogenesis , Mutagenesis, Insertional , Oncogenes , Neoplasms/genetics
7.
Bioinformatics ; 39(2)2023 02 03.
Article in English | MEDLINE | ID: mdl-36708003

ABSTRACT

MOTIVATION: Identifying regulatory regions in the genome is of great interest for understanding the epigenomic landscape in cells. One fundamental challenge in this context is to find the target genes whose expression is affected by the regulatory regions. A recent successful method is the Activity-By-Contact (ABC) model which scores enhancer-gene interactions based on enhancer activity and the contact frequency of an enhancer to its target gene. However, it describes regulatory interactions entirely from a gene's perspective, and does not account for all the candidate target genes of an enhancer. In addition, the ABC model requires two types of assays to measure enhancer activity, which limits the applicability. Moreover, there is neither implementation available that could allow for an integration with transcription factor (TF) binding information nor an efficient analysis of single-cell data. RESULTS: We demonstrate that the ABC score can yield a higher accuracy by adapting the enhancer activity according to the number of contacts the enhancer has to its candidate target genes and also by considering all annotated transcription start sites of a gene. Further, we show that the model is comparably accurate with only one assay to measure enhancer activity. We combined our generalized ABC model with TF binding information and illustrated an analysis of a single-cell ATAC-seq dataset of the human heart, where we were able to characterize cell type-specific regulatory interactions and predict gene expression based on TF affinities. All executed processing steps are incorporated into our new computational pipeline STARE. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/schulzlab/STARE. CONTACT: marcel.schulz@em.uni-frankfurt.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Gene Expression Regulation , Transcription Factors , Humans , Transcription Factors/metabolism , Regulatory Sequences, Nucleic Acid , Software , Protein Binding
8.
Mol Genet Metab ; 142(3): 108511, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38878498

ABSTRACT

The diagnosis of Mendelian disorders has notably advanced with integration of whole exome and genome sequencing (WES and WGS) in clinical practice. However, challenges in variant interpretation and uncovered variants by WES still leave a substantial percentage of patients undiagnosed. In this context, integrating RNA sequencing (RNA-seq) improves diagnostic workflows, particularly for WES inconclusive cases. Additionally, functional studies are often necessary to elucidate the impact of prioritized variants on gene expression and protein function. Our study focused on three unrelated male patients (P1-P3) with ATP6AP1-CDG (congenital disorder of glycosylation), presenting with intellectual disability and varying degrees of hepatopathy, glycosylation defects, and an initially inconclusive diagnosis through WES. Subsequent RNA-seq was pivotal in identifying the underlying genetic causes in P1 and P2, detecting ATP6AP1 underexpression and aberrant splicing. Molecular studies in fibroblasts confirmed these findings and identified the rare intronic variants c.289-233C > T and c.289-289G > A in P1 and P2, respectively. Trio-WGS also revealed the variant c.289-289G > A in P3, which was a de novo change in both patients. Functional assays expressing the mutant alleles in HAP1 cells demonstrated the pathogenic impact of these variants by reproducing the splicing alterations observed in patients. Our study underscores the role of RNA-seq and WGS in enhancing diagnostic rates for genetic diseases such as CDG, providing new insights into ATP6AP1-CDG molecular bases by identifying the first two deep intronic variants in this X-linked gene. Additionally, our study highlights the need to integrate RNA-seq and WGS, followed by functional validation, in routine diagnostics for a comprehensive evaluation of patients with an unidentified molecular etiology.


Subject(s)
Introns , RNA, Messenger , Humans , Male , Introns/genetics , RNA, Messenger/genetics , Vacuolar Proton-Translocating ATPases/genetics , Congenital Disorders of Glycosylation/genetics , Congenital Disorders of Glycosylation/diagnosis , Congenital Disorders of Glycosylation/pathology , Mutation , Whole Genome Sequencing , Exome Sequencing , Sequence Analysis, RNA , Intellectual Disability/genetics , Intellectual Disability/diagnosis , Intellectual Disability/pathology , Child , RNA Splicing/genetics , Child, Preschool
9.
Int J Mol Sci ; 25(14)2024 Jul 16.
Article in English | MEDLINE | ID: mdl-39063034

ABSTRACT

Duchenne and Becker muscular dystrophies, caused by pathogenic variants in DMD, are the most common inherited neuromuscular conditions in childhood. These diseases follow an X-linked recessive inheritance pattern, and mainly males are affected. The most prevalent pathogenic variants in the DMD gene are copy number variants (CNVs), and most patients achieve their genetic diagnosis through Multiplex Ligation-dependent Probe Amplification (MLPA) or exome sequencing. Here, we investigated a female patient presenting with muscular dystrophy who remained genetically undiagnosed after MLPA and exome sequencing. RNA sequencing (RNAseq) from the patient's muscle biopsy identified an 85% reduction in DMD expression compared to 116 muscle samples included in the cohort. A de novo balanced translocation between chromosome 17 and the X chromosome (t(X;17)(p21.1;q23.2)) disrupting the DMD and BCAS3 genes was identified through trio whole genome sequencing (WGS). The combined analysis of RNAseq and WGS played a crucial role in the detection and characterisation of the disease-causing variant in this patient, who had been undiagnosed for over two decades. This case illustrates the diagnostic odyssey of female DMD patients with complex structural variants that are not detected by current panel or exome sequencing analysis.


Subject(s)
Chromosomes, Human, X , Dystrophin , Genomics , Muscular Dystrophy, Duchenne , Translocation, Genetic , Humans , Muscular Dystrophy, Duchenne/genetics , Muscular Dystrophy, Duchenne/diagnosis , Female , Dystrophin/genetics , Chromosomes, Human, X/genetics , Genomics/methods , DNA Copy Number Variations , Exome Sequencing , Transcriptome/genetics , Chromosomes, Human, Pair 17/genetics
10.
Basic Res Cardiol ; 117(1): 6, 2022 02 17.
Article in English | MEDLINE | ID: mdl-35175464

ABSTRACT

The majority of risk loci identified by genome-wide association studies (GWAS) are in non-coding regions, hampering their functional interpretation. Instead, transcriptome-wide association studies (TWAS) identify gene-trait associations, which can be used to prioritize candidate genes in disease-relevant tissue(s). Here, we aimed to systematically identify susceptibility genes for coronary artery disease (CAD) by TWAS. We trained prediction models of nine CAD-relevant tissues using EpiXcan based on two genetics-of-gene-expression panels, the Stockholm-Tartu Atherosclerosis Reverse Network Engineering Task (STARNET) and the Genotype-Tissue Expression (GTEx). Based on these prediction models, we imputed gene expression of respective tissues from individual-level genotype data on 37,997 CAD cases and 42,854 controls for the subsequent gene-trait association analysis. Transcriptome-wide significant association (i.e. P < 3.85e-6) was observed for 114 genes. Of these, 96 resided within previously identified GWAS risk loci and 18 were novel. Stepwise analyses were performed to study their plausibility, biological function, and pathogenicity in CAD, including analyses for colocalization, damaging mutations, pathway enrichment, phenome-wide associations with human data and expression-traits correlations using mouse data. Finally, CRISPR/Cas9-based gene knockdown of two newly identified TWAS genes, RGS19 and KPTN, in a human hepatocyte cell line resulted in reduced secretion of APOB100 and lipids in the cell culture medium. Our CAD TWAS work (i) prioritized candidate causal genes at known GWAS loci, (ii) identified 18 novel genes to be associated with CAD, and iii) suggested potential tissues and pathways of action for these TWAS CAD genes.


Subject(s)
Coronary Artery Disease , Genome-Wide Association Study , Animals , Coronary Artery Disease/genetics , Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Mice , Polymorphism, Single Nucleotide , Transcriptome
11.
PLoS Comput Biol ; 17(5): e1008982, 2021 05.
Article in English | MEDLINE | ID: mdl-33970899

ABSTRACT

The 5' untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5'UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)-a proxy for translation rate-directly from 5'UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5'UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5'UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants.


Subject(s)
5' Untranslated Regions , Deep Learning , Ribosomes/metabolism , Humans , RNA, Messenger/genetics
12.
Int J Mol Sci ; 23(20)2022 Oct 15.
Article in English | MEDLINE | ID: mdl-36293220

ABSTRACT

Peroxisomal biogenesis disorders (PBDs) are a heterogeneous group of genetic diseases. Multiple peroxisomal pathways are impaired, and very long chain fatty acids (VLCFA) are the first line biomarkers for the diagnosis. The clinical presentation of PBDs may range from severe, lethal multisystemic disorders to milder, late-onset disease. The vast majority of PBDs belong to Zellweger Spectrum Disordes (ZSDs) and represents a continuum of overlapping clinical symptoms, with Zellweger syndrome being the most severe and Heimler syndrome the less severe disease. Mild clinical conditions frequently present normal or slight biochemical alterations, making the diagnosis of these patients challenging. In the present study we used a combined WES and RNA-seq strategy to diagnose a patient presenting with retinal dystrophy as the main clinical symptom. Results showed the patient was compound heterozygous for mutations in PEX1. VLCFA were normal, but retrospective analysis of lysosphosphatidylcholines (LPC) containing C22:0-C26:0 species was altered. This simple test could avoid the diagnostic odyssey of patients with mild phenotype, such as the individual described here, who was diagnosed very late in adult life. We provide functional data in cell line models that may explain the mild phenotype of the patient by demonstrating the hypomorphic nature of a deep intronic variant altering PEX1 mRNA processing.


Subject(s)
Deafness , Hearing Loss, Sensorineural , Zellweger Syndrome , Humans , ATPases Associated with Diverse Cellular Activities/metabolism , RNA-Seq , Retrospective Studies , Membrane Proteins/genetics , Membrane Proteins/metabolism , Zellweger Syndrome/diagnosis , Zellweger Syndrome/genetics , Hearing Loss, Sensorineural/genetics , Biomarkers , RNA, Messenger , Fatty Acids
13.
Am J Hum Genet ; 103(6): 907-917, 2018 12 06.
Article in English | MEDLINE | ID: mdl-30503520

ABSTRACT

RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (Outlier in RNA-Seq Finder), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read-count expectations according to the gene covariation resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best recall of artificially corrupted data. Precision-recall analyses using simulated outlier read counts demonstrated the importance of controlling for covariation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a dataset, for identifying outlier samples with too many aberrantly expressed genes, and for detecting aberrant gene expression on the basis of false-discovery-rate-adjusted p values. Overall, OUTRIDER provides an end-to-end solution for identifying aberrantly expressed genes and is suitable for use by rare-disease diagnostic platforms.


Subject(s)
Gene Expression/genetics , Genetic Variation/genetics , RNA/metabolism , Sequence Analysis, RNA/methods , Algorithms , Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Humans
14.
Am J Hum Genet ; 100(1): 151-159, 2017 Jan 05.
Article in English | MEDLINE | ID: mdl-27989324

ABSTRACT

MDH2 encodes mitochondrial malate dehydrogenase (MDH), which is essential for the conversion of malate to oxaloacetate as part of the proper functioning of the Krebs cycle. We report bi-allelic pathogenic mutations in MDH2 in three unrelated subjects presenting with early-onset generalized hypotonia, psychomotor delay, refractory epilepsy, and elevated lactate in the blood and cerebrospinal fluid. Functional studies in fibroblasts from affected subjects showed both an apparently complete loss of MDH2 levels and MDH2 enzymatic activity close to null. Metabolomics analyses demonstrated a significant concomitant accumulation of the MDH substrate, malate, and fumarate, its immediate precursor in the Krebs cycle, in affected subjects' fibroblasts. Lentiviral complementation with wild-type MDH2 cDNA restored MDH2 levels and mitochondrial MDH activity. Additionally, introduction of the three missense mutations from the affected subjects into Saccharomyces cerevisiae provided functional evidence to support their pathogenicity. Disruption of the Krebs cycle is a hallmark of cancer, and MDH2 has been recently identified as a novel pheochromocytoma and paraganglioma susceptibility gene. We show that loss-of-function mutations in MDH2 are also associated with severe neurological clinical presentations in children.


Subject(s)
Brain Diseases/genetics , Citric Acid Cycle , Malate Dehydrogenase/genetics , Mutation , Age of Onset , Alleles , Amino Acid Sequence , Child , Child, Preschool , Citric Acid Cycle/genetics , Fibroblasts/enzymology , Fibroblasts/metabolism , Fumarates/metabolism , Genetic Complementation Test , Humans , Infant , Infant, Newborn , Malate Dehydrogenase/chemistry , Malate Dehydrogenase/metabolism , Malates/metabolism , Male , Metabolomics , Models, Molecular
15.
Mol Syst Biol ; 15(2): e8513, 2019 02 18.
Article in English | MEDLINE | ID: mdl-30777893

ABSTRACT

Despite their importance in determining protein abundance, a comprehensive catalogue of sequence features controlling protein-to-mRNA (PTR) ratios and a quantification of their effects are still lacking. Here, we quantified PTR ratios for 11,575 proteins across 29 human tissues using matched transcriptomes and proteomes. We estimated by regression the contribution of known sequence determinants of protein synthesis and degradation in addition to 45 mRNA and 3 protein sequence motifs that we found by association testing. While PTR ratios span more than 2 orders of magnitude, our integrative model predicts PTR ratios at a median precision of 3.2-fold. A reporter assay provided functional support for two novel UTR motifs, and an immobilized mRNA affinity competition-binding assay identified motif-specific bound proteins for one motif. Moreover, our integrative model led to a new metric of codon optimality that captures the effects of codon frequency on protein synthesis and degradation. Altogether, this study shows that a large fraction of PTR ratio variation in human tissues can be predicted from sequence, and it identifies many new candidate post-transcriptional regulatory elements.


Subject(s)
Proteins/genetics , Proteome/genetics , Tissue Distribution/genetics , Transcriptome/genetics , Gene Expression Regulation/genetics , Genome, Human/genetics , Humans , Mass Spectrometry/methods , Proteomics/methods , RNA, Messenger/genetics , Sequence Analysis, RNA/methods
16.
Mol Syst Biol ; 15(2): e8503, 2019 02 18.
Article in English | MEDLINE | ID: mdl-30777892

ABSTRACT

Genome-, transcriptome- and proteome-wide measurements provide insights into how biological systems are regulated. However, fundamental aspects relating to which human proteins exist, where they are expressed and in which quantities are not fully understood. Therefore, we generated a quantitative proteome and transcriptome abundance atlas of 29 paired healthy human tissues from the Human Protein Atlas project representing human genes by 18,072 transcripts and 13,640 proteins including 37 without prior protein-level evidence. The analysis revealed that hundreds of proteins, particularly in testis, could not be detected even for highly expressed mRNAs, that few proteins show tissue-specific expression, that strong differences between mRNA and protein quantities within and across tissues exist and that protein expression is often more stable across tissues than that of transcripts. Only 238 of 9,848 amino acid variants found by exome sequencing could be confidently detected at the protein level showing that proteogenomics remains challenging, needs better computational methods and requires rigorous validation. Many uses of this resource can be envisaged including the study of gene/protein expression regulation and biomarker specificity evaluation.


Subject(s)
Genome, Human/genetics , Proteome/genetics , Tissue Distribution/genetics , Transcriptome/genetics , Gene Expression Regulation/genetics , Humans , Mass Spectrometry/methods , Proteomics/methods , RNA, Messenger/genetics , Sequence Analysis, RNA/methods
17.
Hum Mutat ; 40(9): 1243-1251, 2019 09.
Article in English | MEDLINE | ID: mdl-31070280

ABSTRACT

Pathogenic genetic variants often primarily affect splicing. However, it remains difficult to quantitatively predict whether and how genetic variants affect splicing. In 2018, the fifth edition of the Critical Assessment of Genome Interpretation proposed two splicing prediction challenges based on experimental perturbation assays: Vex-seq, assessing exon skipping, and MaPSy, assessing splicing efficiency. We developed a modular modeling framework, MMSplice, the performance of which was among the best on both challenges. Here we provide insights into the modeling assumptions of MMSplice and its individual modules. We furthermore illustrate how MMSplice can be applied in practice for individual genome interpretation, using the MMSplice VEP plugin and the Kipoi variant interpretation plugin, which are directly applicable to VCF files.


Subject(s)
Computational Biology/methods , Genetic Variation , RNA Splicing , Congresses as Topic , Exons , Genetic Predisposition to Disease , Humans , Introns , Models, Genetic , Software
18.
Hum Mutat ; 40(9): 1215-1224, 2019 09.
Article in English | MEDLINE | ID: mdl-31301154

ABSTRACT

Precision medicine and sequence-based clinical diagnostics seek to predict disease risk or to identify causative variants from sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. In the past, few CAGI challenges have addressed the impact of sequence variants on splicing. In CAGI5, two challenges (Vex-seq and MaPSY) involved prediction of the effect of variants, primarily single-nucleotide changes, on splicing. Although there are significant differences between these two challenges, both involved prediction of results from high-throughput exon inclusion assays. Here, we discuss the methods used to predict the impact of these variants on splicing, their performance, strengths, and weaknesses, and prospects for predicting the impact of sequence variation on splicing and disease phenotypes.


Subject(s)
Alternative Splicing , Computational Biology/methods , Mutation , Proteins/genetics , Animals , Congresses as Topic , Genetic Fitness , Humans , Models, Genetic , Sequence Homology, Nucleic Acid
19.
RNA ; 23(11): 1648-1659, 2017 11.
Article in English | MEDLINE | ID: mdl-28802259

ABSTRACT

The stability of mRNA is one of the major determinants of gene expression. Although a wealth of sequence elements regulating mRNA stability has been described, their quantitative contributions to half-life are unknown. Here, we built a quantitative model for Saccharomyces cerevisiae based on functional mRNA sequence features that explains 59% of the half-life variation between genes and predicts half-life at a median relative error of 30%. The model revealed a new destabilizing 3' UTR motif, ATATTC, which we functionally validated. Codon usage proves to be the major determinant of mRNA stability. Nonetheless, single-nucleotide variations have the largest effect when occurring on 3' UTR motifs or upstream AUGs. Analyzing mRNA half-life data of 34 knockout strains showed that the effect of codon usage not only requires functional decapping and deadenylation, but also the 5'-to-3' exonuclease Xrn1, the nonsense-mediated decay genes, but not no-go decay. Altogether, this study quantitatively delineates the contributions of mRNA sequence features on stability in yeast, reveals their functional dependencies on degradation pathways, and allows accurate prediction of half-life from mRNA sequence.


Subject(s)
RNA Stability/genetics , RNA, Fungal/genetics , RNA, Fungal/metabolism , RNA, Messenger/genetics , RNA, Messenger/metabolism , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , 3' Untranslated Regions/genetics , Base Sequence , Codon/genetics , Codon/metabolism , Gene Knockout Techniques , Genes, Fungal , Half-Life , Models, Biological , Nonsense Mediated mRNA Decay/genetics , Peptide Chain Initiation, Translational , Regulatory Elements, Transcriptional , Schizosaccharomyces/genetics , Schizosaccharomyces/metabolism
20.
Bioinformatics ; 34(8): 1261-1269, 2018 04 15.
Article in English | MEDLINE | ID: mdl-29155928

ABSTRACT

Motivation: Regulatory sequences are not solely defined by their nucleic acid sequence but also by their relative distances to genomic landmarks such as transcription start site, exon boundaries or polyadenylation site. Deep learning has become the approach of choice for modeling regulatory sequences because of its strength to learn complex sequence features. However, modeling relative distances to genomic landmarks in deep neural networks has not been addressed. Results: Here we developed spline transformation, a neural network module based on splines to flexibly and robustly model distances. Modeling distances to various genomic landmarks with spline transformations significantly increased state-of-the-art prediction accuracy of in vivo RNA-binding protein binding sites for 120 out of 123 proteins. We also developed a deep neural network for human splice branchpoint based on spline transformations that outperformed the current best, already distance-based, machine learning model. Compared to piecewise linear transformation, as obtained by composition of rectified linear units, spline transformation yields higher prediction accuracy as well as faster and more robust training. As spline transformation can be applied to further quantities beyond distances, such as methylation or conservation, we foresee it as a versatile component in the genomics deep learning toolbox. Availability and implementation: Spline transformation is implemented as a Keras layer in the CONCISE python package: https://github.com/gagneurlab/concise. Analysis code is available at https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017. Contact: avsec@in.tum.de or gagneur@in.tum.de. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics/methods , Models, Genetic , Neural Networks, Computer , Regulatory Sequences, Nucleic Acid , DNA , Hep G2 Cells , Humans , K562 Cells , Machine Learning , Protein Binding , Proteins/metabolism , RNA , Sequence Analysis, DNA/methods , Sequence Analysis, RNA/methods , Software
SELECTION OF CITATIONS
SEARCH DETAIL