RESUMEN
Detection of aberrantly spliced genes is an important step in RNA-seq-based rare-disease diagnostics. We recently developed FRASER, a denoising autoencoder-based method that outperformed alternative methods of detecting aberrant splicing. However, because FRASER's three splice metrics are partially redundant and tend to be sensitive to sequencing depth, we introduce here a more robust intron-excision metric, the intron Jaccard index, that combines the alternative donor, alternative acceptor, and intron-retention signal into a single value. Moreover, we optimized model parameters and filter cutoffs by using candidate rare-splice-disrupting variants as independent evidence. On 16,213 GTEx samples, our improved algorithm, FRASER 2.0, called typically 10 times fewer splicing outliers while increasing the proportion of candidate rare-splice-disrupting variants by 10-fold and substantially decreasing the effect of sequencing depth on the number of reported outliers. To lower the multiple-testing correction burden, we introduce an option to select the genes to be tested for each sample instead of a transcriptome-wide approach. This option can be particularly useful when prior information, such as candidate variants or genes, is available. Application on 303 rare-disease samples confirmed the relative reduction in the number of outlier calls for a slight loss of sensitivity; FRASER 2.0 recovered 22 out of 26 previously identified pathogenic splicing cases with default cutoffs and 24 when multiple-testing correction was limited to OMIM genes containing rare variants. Altogether, these methodological improvements contribute to more effective RNA-seq-based rare diagnostics by drastically reducing the amount of splicing outlier calls per sample at minimal loss of sensitivity.
Asunto(s)
Empalme Alternativo , Empalme del ARN , Humanos , Empalme Alternativo/genética , Intrones/genética , Empalme del ARN/genética , RNA-Seq , AlgoritmosRESUMEN
RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (Outlier in RNA-Seq Finder), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read-count expectations according to the gene covariation resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best recall of artificially corrupted data. Precision-recall analyses using simulated outlier read counts demonstrated the importance of controlling for covariation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a dataset, for identifying outlier samples with too many aberrantly expressed genes, and for detecting aberrant gene expression on the basis of false-discovery-rate-adjusted p values. Overall, OUTRIDER provides an end-to-end solution for identifying aberrantly expressed genes and is suitable for use by rare-disease diagnostic platforms.
Asunto(s)
Expresión Génica/genética , Variación Genética/genética , ARN/metabolismo , Análisis de Secuencia de ARN/métodos , Algoritmos , Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , HumanosRESUMEN
MDH2 encodes mitochondrial malate dehydrogenase (MDH), which is essential for the conversion of malate to oxaloacetate as part of the proper functioning of the Krebs cycle. We report bi-allelic pathogenic mutations in MDH2 in three unrelated subjects presenting with early-onset generalized hypotonia, psychomotor delay, refractory epilepsy, and elevated lactate in the blood and cerebrospinal fluid. Functional studies in fibroblasts from affected subjects showed both an apparently complete loss of MDH2 levels and MDH2 enzymatic activity close to null. Metabolomics analyses demonstrated a significant concomitant accumulation of the MDH substrate, malate, and fumarate, its immediate precursor in the Krebs cycle, in affected subjects' fibroblasts. Lentiviral complementation with wild-type MDH2 cDNA restored MDH2 levels and mitochondrial MDH activity. Additionally, introduction of the three missense mutations from the affected subjects into Saccharomyces cerevisiae provided functional evidence to support their pathogenicity. Disruption of the Krebs cycle is a hallmark of cancer, and MDH2 has been recently identified as a novel pheochromocytoma and paraganglioma susceptibility gene. We show that loss-of-function mutations in MDH2 are also associated with severe neurological clinical presentations in children.
Asunto(s)
Encefalopatías/genética , Ciclo del Ácido Cítrico , Malato Deshidrogenasa/genética , Mutación , Edad de Inicio , Alelos , Secuencia de Aminoácidos , Niño , Preescolar , Ciclo del Ácido Cítrico/genética , Fibroblastos/enzimología , Fibroblastos/metabolismo , Fumaratos/metabolismo , Prueba de Complementación Genética , Humanos , Lactante , Recién Nacido , Malato Deshidrogenasa/química , Malato Deshidrogenasa/metabolismo , Malatos/metabolismo , Masculino , Metabolómica , Modelos MolecularesRESUMEN
Abnormalities in alternative splicing are a hallmark of cancer formation. In this study, we investigated the role of the splicing factor PHD finger protein 5A (PHF5A) in melanoma. Malignant melanoma is the deadliest form of skin cancer, and patients with a high PHF5A expression show poor overall survival. Our data revealed that an siRNA-mediated downregulation of PHF5A in different melanoma cell lines leads to massive splicing defects of different tumour-relevant genes. The loss of PHF5A results in an increased rate of apoptosis by triggering Fas- and unfolded protein response (UPR)-mediated apoptosis pathways in melanoma cells. These findings are tumour-specific because we did not observe this regulation in fibroblasts. Our study identifies a crucial role of PHF5A as driver for melanoma malignancy and the described underlying splicing network provides an interesting basis for the development of new therapeutic targets for this aggressive form of skin cancer.
RESUMEN
Detection of aberrantly spliced genes is an important step in RNA-seq-based rare disease diagnostics. We recently developed FRASER, a denoising autoencoder-based method for aberrant splicing detection that outperformed alternative approaches. However, as FRASER's three splice metrics are partially redundant and tend to be sensitive to sequencing depth, we introduce here a more robust intron excision metric, the Intron Jaccard Index, that combines alternative donor, alternative acceptor, and intron retention signal into a single value. Moreover, we optimized model parameters and filter cutoffs using candidate rare splice-disrupting variants as independent evidence. On 16,213 GTEx samples, our improved algorithm called typically 10 times fewer splicing outliers while increasing the proportion of candidate rare splice-disrupting variants by 10 fold and substantially decreasing the effect of sequencing depth on the number of reported outliers. Application on 303 rare disease samples confirmed the reduction fold-change of the number of outlier calls for a slight loss of sensitivity (only 2 out of 22 previously identified pathogenic splicing cases not recovered). Altogether, these methodological improvements contribute to more effective RNA-seq-based rare diagnostics by a drastic reduction of the amount of splicing outlier calls per sample at minimal loss of sensitivity.
RESUMEN
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.
RESUMEN
Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models can prioritize rare variants for affecting splicing, their performance in predicting tissue-specific aberrant splicing remains unassessed. Here we generated an aberrant splicing benchmark dataset, spanning over 8.8 million rare variants in 49 human tissues from the Genotype-Tissue Expression (GTEx) dataset. At 20% recall, state-of-the-art DNA-based models achieve maximum 12% precision. By mapping and quantifying tissue-specific splice site usage transcriptome-wide and modeling isoform competition, we increased precision by threefold at the same recall. Integrating RNA-sequencing data of clinically accessible tissues into our model, AbSplice, brought precision to 60%. These results, replicated in two independent cohorts, substantially contribute to noncoding loss-of-function variant identification and to genetic diagnostics design and analytics.
Asunto(s)
Empalme Alternativo , Empalme del ARN , Humanos , Empalme del ARN/genética , Empalme Alternativo/genética , Análisis de Secuencia de ARN/métodos , Transcriptoma , Isoformas de ProteínasRESUMEN
We present the results of the human genomic small variant calling benchmarking initiative of the German Research Foundation (DFG) funded Next Generation Sequencing Competence Network (NGS-CN) and the German Human Genome-Phenome Archive (GHGA). In this effort, we developed NCBench, a continuous benchmarking platform for the evaluation of small genomic variant callsets in terms of recall, precision, and false positive/negative error patterns. NCBench is implemented as a continuously re-evaluated open-source repository. We show that it is possible to entirely rely on public free infrastructure (Github, Github Actions, Zenodo) in combination with established open-source tools. NCBench is agnostic of the used dataset and can evaluate an arbitrary number of given callsets, while reporting the results in a visual and interactive way. We used NCBench to evaluate over 40 callsets generated by various variant calling pipelines available in the participating groups that were run on three exome datasets from different enrichment kits and at different coverages. While all pipelines achieve high overall quality, subtle systematic differences between callers and datasets exist and are made apparent by NCBench.These insights are useful to improve existing pipelines and develop new workflows. NCBench is meant to be open for the contribution of any given callset. Most importantly, for authors, it will enable the omission of repeated re-implementation of paper-specific variant calling benchmarks for the publication of new tools or pipelines, while readers will benefit from being able to (continuously) observe the performance of tools and pipelines at the time of reading instead of at the time of writing.
Asunto(s)
Benchmarking , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Humanos , Benchmarking/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Programas Informáticos , Genoma Humano , Variación Genética , Reproducibilidad de los Resultados , Genómica/métodosRESUMEN
BACKGROUND: Lack of functional evidence hampers variant interpretation, leaving a large proportion of individuals with a suspected Mendelian disorder without genetic diagnosis after whole genome or whole exome sequencing (WES). Research studies advocate to further sequence transcriptomes to directly and systematically probe gene expression defects. However, collection of additional biopsies and establishment of lab workflows, analytical pipelines, and defined concepts in clinical interpretation of aberrant gene expression are still needed for adopting RNA sequencing (RNA-seq) in routine diagnostics. METHODS: We implemented an automated RNA-seq protocol and a computational workflow with which we analyzed skin fibroblasts of 303 individuals with a suspected mitochondrial disease that previously underwent WES. We also assessed through simulations how aberrant expression and mono-allelic expression tests depend on RNA-seq coverage. RESULTS: We detected on average 12,500 genes per sample including around 60% of all disease genes-a coverage substantially higher than with whole blood, supporting the use of skin biopsies. We prioritized genes demonstrating aberrant expression, aberrant splicing, or mono-allelic expression. The pipeline required less than 1 week from sample preparation to result reporting and provided a median of eight disease-associated genes per patient for inspection. A genetic diagnosis was established for 16% of the 205 WES-inconclusive cases. Detection of aberrant expression was a major contributor to diagnosis including instances of 50% reduction, which, together with mono-allelic expression, allowed for the diagnosis of dominant disorders caused by haploinsufficiency. Moreover, calling aberrant splicing and variants from RNA-seq data enabled detecting and validating splice-disrupting variants, of which the majority fell outside WES-covered regions. CONCLUSION: Together, these results show that streamlined experimental and computational processes can accelerate the implementation of RNA-seq in routine diagnostics.
Asunto(s)
ARN , Transcriptoma , Alelos , Humanos , Análisis de Secuencia de ARN/métodos , Secuenciación del ExomaRESUMEN
Aberrant splicing is a major cause of rare diseases. However, its prediction from genome sequence alone remains in most cases inconclusive. Recently, RNA sequencing has proven to be an effective complementary avenue to detect aberrant splicing. Here, we develop FRASER, an algorithm to detect aberrant splicing from RNA sequencing data. Unlike existing methods, FRASER captures not only alternative splicing but also intron retention events. This typically doubles the number of detected aberrant events and identified a pathogenic intron retention in MCOLN1 causing mucolipidosis. FRASER automatically controls for latent confounders, which are widespread and affect sensitivity substantially. Moreover, FRASER is based on a count distribution and multiple testing correction, thus reducing the number of calls by two orders of magnitude over commonly applied z score cutoffs, with a minor loss of sensitivity. Applying FRASER to rare disease diagnostics is demonstrated by reprioritizing a pathogenic aberrant exon truncation in TAZ from a published dataset. FRASER is easy to use and freely available.
Asunto(s)
Algoritmos , Empalme Alternativo , Biología Computacional/métodos , RNA-Seq/métodos , Análisis de Secuencia de ARN/métodos , Internet , Intrones/genética , Programas InformáticosRESUMEN
RNA sequencing (RNA-seq) has emerged as a powerful approach to discover disease-causing gene regulatory defects in individuals affected by genetically undiagnosed rare disorders. Pioneering studies have shown that RNA-seq could increase the diagnosis rates over DNA sequencing alone by 8-36%, depending on the disease entity and tissue probed. To accelerate adoption of RNA-seq by human genetics centers, detailed analysis protocols are now needed. We present a step-by-step protocol that details how to robustly detect aberrant expression levels, aberrant splicing and mono-allelic expression in RNA-seq data using dedicated statistical methods. We describe how to generate and assess quality control plots and interpret the analysis results. The protocol is based on the detection of RNA outliers pipeline (DROP), a modular computational workflow that integrates all the analysis steps, can leverage parallel computing infrastructures and generates browsable web page reports.
Asunto(s)
Secuencia de Bases/genética , Expresión Génica/genética , Análisis de Secuencia de ARN/métodos , Diagnóstico , Técnicas y Procedimientos Diagnósticos , Enfermedad/genética , Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , ARN/genética , Programas Informáticos , Flujo de TrabajoRESUMEN
In hyper-IgE syndromes (HIES), a group of primary immunodeficiencies clinically overlapping with atopic dermatitis, early diagnosis is crucial to initiate appropriate therapy and prevent irreversible complications. Identification of underlying gene defects such as in DOCK8 and STAT3 and corresponding molecular testing has improved diagnosis. Yet, in a child and her newborn sibling with HIES phenotype molecular diagnosis was misleading. Extensive analyses driven by the clinical phenotype identified an intronic homozygous DOCK8 variant c.4626 + 76 A > G creating a novel splice site as disease-causing. While the affected newborn carrying the homozygous variant had no expression of DOCK8 protein, in the index patient molecular diagnosis was compromised due to expression of altered and wildtype DOCK8 transcripts and DOCK8 protein as well as defective STAT3 signaling. Sanger sequencing of lymphocyte subsets revealed that somatic alterations and reversions revoked the predominance of the novel over the canonical splice site in the index patient explaining DOCK8 protein expression, whereas defective STAT3 responses in the index patient were explained by a T cell phenotype skewed towards central and effector memory T cells. Hence, somatic alterations and skewed immune cell phenotypes due to selective pressure may compromise molecular diagnosis and need to be considered with unexpected clinical and molecular findings.
Asunto(s)
Factores de Intercambio de Guanina Nucleótido/genética , Intrones/genética , Síndrome de Job/genética , Mutación , Sitios de Empalme de ARN/genética , Secuencia de Bases , Preescolar , Biología Computacional , Femenino , Regulación de la Expresión Génica/genética , Humanos , Lactante , Síndrome de Job/patología , Técnicas de Diagnóstico Molecular , Embarazo , Factor de Transcripción STAT3/metabolismo , Transducción de Señal/genéticaRESUMEN
Across a variety of Mendelian disorders, â¼50-75% of patients do not receive a genetic diagnosis by exome sequencing indicating disease-causing variants in non-coding regions. Although genome sequencing in principle reveals all genetic variants, their sizeable number and poorer annotation make prioritization challenging. Here, we demonstrate the power of transcriptome sequencing to molecularly diagnose 10% (5 of 48) of mitochondriopathy patients and identify candidate genes for the remainder. We find a median of one aberrantly expressed gene, five aberrant splicing events and six mono-allelically expressed rare variants in patient-derived fibroblasts and establish disease-causing roles for each kind. Private exons often arise from cryptic splice sites providing an important clue for variant prioritization. One such event is found in the complex I assembly factor TIMMDC1 establishing a novel disease-associated gene. In conclusion, our study expands the diagnostic tools for detecting non-exonic variants and provides examples of intronic loss-of-function variants with pathological relevance.
Asunto(s)
Perfilación de la Expresión Génica , Enfermedades Mitocondriales/genética , Análisis de Secuencia de ARN , Técnicas y Procedimientos Diagnósticos , Humanos , Empalme del ARNRESUMEN
We identify SMARCD2 (SWI/SNF-related, matrix-associated, actin-dependent regulator of chromatin, subfamily D, member 2), also known as BAF60b (BRG1/Brahma-associated factor 60b), as a critical regulator of myeloid differentiation in humans, mice, and zebrafish. Studying patients from three unrelated pedigrees characterized by neutropenia, specific granule deficiency, myelodysplasia with excess of blast cells, and various developmental aberrations, we identified three homozygous loss-of-function mutations in SMARCD2. Using mice and zebrafish as model systems, we showed that SMARCD2 controls early steps in the differentiation of myeloid-erythroid progenitor cells. In vitro, SMARCD2 interacts with the transcription factor CEBPÉ and controls expression of neutrophil proteins stored in specific granules. Defective expression of SMARCD2 leads to transcriptional and chromatin changes in acute myeloid leukemia (AML) human promyelocytic cells. In summary, SMARCD2 is a key factor controlling myelopoiesis and is a potential tumor suppressor in leukemia.
Asunto(s)
Diferenciación Celular/genética , Redes Reguladoras de Genes , Neutrófilos/metabolismo , Factores de Transcripción/genética , Animales , Animales Modificados Genéticamente , Secuencia de Bases , Línea Celular Tumoral , Ensamble y Desensamble de Cromatina , Proteínas Cromosómicas no Histona , Análisis Mutacional de ADN , Salud de la Familia , Femenino , Humanos , Leucemia Promielocítica Aguda/genética , Leucemia Promielocítica Aguda/patología , Masculino , Ratones Endogámicos C57BL , Ratones Noqueados , Linaje , Pez CebraRESUMEN
We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome.