RESUMEN
We introduce SPLASH2, a fast, scalable implementation of SPLASH based on an efficient k-mer counting approach for regulated sequence variation detection in massive datasets from a wide range of sequencing technologies and biological contexts. We demonstrate biological discovery by SPLASH2 in single-cell RNA sequencing (RNA-seq) data and in bulk RNA-seq data from the Cancer Cell Line Encyclopedia, including unannotated alternative splicing in cancer transcriptomes and sensitive detection of circular RNA.
RESUMEN
Most plant genomes and their regulation remain unknown. We used SPLASH - a new, reference-genome free sequence variation detection algorithm - to analyze transcriptional and post-transcriptional regulation from RNA-seq data. We discovered differential homolog expression during maize pollen development, and imbibition-dependent cryptic splicing in Arabidopsis seeds. SPLASH enables discovery of novel regulatory mechanisms, including differential regulation of genes from hybrid parental haplotypes, without the use of alignment to a reference genome.
RESUMEN
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
Asunto(s)
Genoma , Genómica , Mapeo CromosómicoRESUMEN
Early stages of deadly respiratory diseases including COVID-19 are challenging to elucidate in humans. Here, we define cellular tropism and transcriptomic effects of SARS-CoV-2 virus by productively infecting healthy human lung tissue and using scRNA-seq to reconstruct the transcriptional program in "infection pseudotime" for individual lung cell types. SARS-CoV-2 predominantly infected activated interstitial macrophages (IMs), which can accumulate thousands of viral RNA molecules, taking over 60% of the cell transcriptome and forming dense viral RNA bodies while inducing host profibrotic (TGFB1, SPP1) and inflammatory (early interferon response, CCL2/7/8/13, CXCL10, and IL6/10) programs and destroying host cell architecture. Infected alveolar macrophages (AMs) showed none of these extreme responses. Spike-dependent viral entry into AMs used ACE2 and Sialoadhesin/CD169, whereas IM entry used DC-SIGN/CD209. These results identify activated IMs as a prominent site of viral takeover, the focus of inflammation and fibrosis, and suggest targeting CD209 to prevent early pathology in COVID-19 pneumonia. This approach can be generalized to any human lung infection and to evaluate therapeutics.
Asunto(s)
COVID-19 , Humanos , SARS-CoV-2 , Macrófagos , Inflamación , ARN Viral , PulmónRESUMEN
SPLASH is an unsupervised, reference-free, and unifying algorithm that discovers regulated sequence variation through statistical analysis of k-mer composition, subsuming many application-specific methods. Here, we introduce SPLASH2, a fast, scalable implementation of SPLASH based on an efficient k-mer counting approach. SPLASH2 enables rapid analysis of massive datasets from a wide range of sequencing technologies and biological contexts, delivering unparalleled scale and speed. The SPLASH2 algorithm unveils new biology (without tuning) in single-cell RNA-sequencing data from human muscle cells, as well as bulk RNA-seq from the entire Cancer Cell Line Encyclopedia (CCLE), including substantial unannotated alternative splicing in cancer transcriptome. The same untuned SPLASH2 algorithm recovers the BCR-ABL gene fusion, and detects circRNA sensitively and specifically, underscoring SPLASH2's unmatched precision and scalability across diverse RNA-seq detection tasks.
RESUMEN
Today's genomics workflows typically require alignment to a reference sequence, which limits discovery. We introduce a unifying paradigm, SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), which directly analyzes raw sequencing data, using a statistical test to detect a signature of regulation: sample-specific sequence variation. SPLASH detects many types of variation and can be efficiently run at scale. We show that SPLASH identifies complex mutation patterns in SARS-CoV-2, discovers regulated RNA isoforms at the single-cell level, detects the vast sequence diversity of adaptive immune receptors, and uncovers biology in non-model organisms undocumented in their reference genomes: geographic and seasonal variation and diatom association in eelgrass, an oceanic plant impacted by climate change, and tissue-specific transcripts in octopus. SPLASH is a unifying approach to genomic analysis that enables expansive discovery without metadata or references.
Asunto(s)
Algoritmos , Genómica , Genoma , Análisis de Secuencia de ARN , Humanos , Antígenos HLA/genética , Análisis de la Célula IndividualRESUMEN
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference (1), we develop OASIS (Optimized Adaptive Statistic for Inferring Structure), a family of statistical tests for contingency tables. OASIS constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's p-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. The same method based on OASIS significance calls detects SARS-CoV-2 and Mycobacterium Tuberculosis strains de novo, which cannot be achieved with current approaches. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single cell RNA-sequencing, where under accepted noise models OASIS still provides good control of the false discovery rate, while Pearson's X2 test consistently rejects the null. Additionally, we show on synthetic data that OASIS is more powerful than Pearson's X2 test in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
RESUMEN
Diversity-generating and mobile genetic elements are key to microbial and viral evolution and can result in evolutionary leaps. State-of-the-art algorithms to detect these elements have limitations. Here, we introduce DIVE, a new reference-free approach to overcome these limitations using information contained in sequencing reads alone. We show that DIVE has improved detection power compared to existing reference-based methods using simulations and real data. We use DIVE to rediscover and characterize the activity of known and novel elements and generate new biological hypotheses about the mobilome. Building on DIVE, we develop a reference-free framework capable of de novo discovery of mobile genetic elements.
Asunto(s)
Transferencia de Gen Horizontal , Secuencias Repetitivas Esparcidas , Elementos Transponibles de ADNRESUMEN
The detection of circular RNA molecules (circRNAs) is typically based on short-read RNA sequencing data processed using computational tools. Numerous such tools have been developed, but a systematic comparison with orthogonal validation is missing. Here, we set up a circRNA detection tool benchmarking study, in which 16 tools detected more than 315,000 unique circRNAs in three deeply sequenced human cell types. Next, 1,516 predicted circRNAs were validated using three orthogonal methods. Generally, tool-specific precision is high and similar (median of 98.8%, 96.3% and 95.5% for qPCR, RNase R and amplicon sequencing, respectively) whereas the sensitivity and number of predicted circRNAs (ranging from 1,372 to 58,032) are the most significant differentiators. Of note, precision values are lower when evaluating low-abundance circRNAs. We also show that the tools can be used complementarily to increase detection sensitivity. Finally, we offer recommendations for future circRNA detection and validation.
Asunto(s)
Benchmarking , ARN Circular , Humanos , ARN Circular/genética , ARN/genética , ARN/metabolismo , Análisis de Secuencia de ARN/métodosRESUMEN
The authors have withdrawn this manuscript due to a duplicate posting of manuscript number BIORXIV/2022/497555. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author. The correct preprint can be found at doi: https://doi.org/10.1101/2022.06.24.497555.
RESUMEN
Technical advances have led to an explosion in the amount of biological data available in recent years, especially in the field of RNA sequencing. Specifically, spatial transcriptomics (ST) datasets, which allow each RNA molecule to be mapped to the 2D location it originated from within a tissue, have become readily available. Due to computational challenges, ST data has rarely been used to study RNA processing such as splicing or differential UTR usage. We apply the ReadZS and the SpliZ, methods developed to analyze RNA process in scRNA-seq data, to analyze spatial localization of RNA processing directly from ST data for the first time. Using Moran's I metric for spatial autocorrelation, we identify genes with spatially regulated RNA processing in the mouse brain and kidney, re-discovering known spatial regulation in Myl6 and identifying previously-unknown spatial regulation in genes such as Rps24, Gng13, Slc8a1, Gpm6a, Gpx3, ActB, Rps8, and S100A9. The rich set of discoveries made here from commonly used reference datasets provides a small taste of what can be learned by applying this technique more broadly to the large quantity of Visium data currently being created.
RESUMEN
Today's genomics workflows typically require alignment to a reference sequence, which limits discovery. We introduce a new unifying paradigm, SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), an approach that directly analyzes raw sequencing data to detect a signature of regulation: sample-specific sequence variation. The approach, which includes a new statistical test, is computationally efficient and can be run at scale. SPLASH unifies detection of myriad forms of sequence variation. We demonstrate that SPLASH identifies complex mutation patterns in SARS-CoV-2 strains, discovers regulated RNA isoforms at the single cell level, documents the vast sequence diversity of adaptive immune receptors, and uncovers biology in non-model organisms undocumented in their reference genomes: geographic and seasonal variation and diatom association in eelgrass, an oceanic plant impacted by climate change, and tissue-specific transcripts in octopus. SPLASH is a new unifying approach to genomic analysis that enables an expansive scope of discovery without metadata or references.
RESUMEN
RNA processing, including splicing and alternative polyadenylation, is crucial to gene function and regulation, but methods to detect RNA processing from single-cell RNA sequencing data are limited by reliance on pre-existing annotations, peak calling heuristics, and collapsing measurements by cell type. We introduce ReadZS, an annotation-free statistical approach to identify regulated RNA processing in single cells. ReadZS discovers cell type-specific RNA processing in human lung and conserved, developmentally regulated RNA processing in mammalian spermatogenesis-including global 3' UTR shortening in human spermatogenesis. ReadZS also discovers global 3' UTR lengthening in Arabidopsis development, highlighting the usefulness of this method in under-annotated transcriptomes.
Asunto(s)
Poliadenilación , Transcriptoma , Animales , Humanos , Regiones no Traducidas 3' , RNA-Seq , Análisis de Secuencia de ARN/métodos , Mamíferos/genéticaRESUMEN
Trimethylguanosine synthase 1 (TGS1) is a highly conserved enzyme that converts the 5'-monomethylguanosine cap of small nuclear RNAs (snRNAs) to a trimethylguanosine cap. Here, we show that loss of TGS1 in Caenorhabditis elegans, Drosophila melanogaster and Danio rerio results in neurological phenotypes similar to those caused by survival motor neuron (SMN) deficiency. Importantly, expression of human TGS1 ameliorates the SMN-dependent neurological phenotypes in both flies and worms, revealing that TGS1 can partly counteract the effects of SMN deficiency. TGS1 loss in HeLa cells leads to the accumulation of immature U2 and U4atac snRNAs with long 3' tails that are often uridylated. snRNAs with defective 3' terminations also accumulate in Drosophila Tgs1 mutants. Consistent with defective snRNA maturation, TGS1 and SMN mutant cells also exhibit partially overlapping transcriptome alterations that include aberrantly spliced and readthrough transcripts. Together, these results identify a neuroprotective function for TGS1 and reinforce the view that defective snRNA maturation affects neuronal viability and function.
Asunto(s)
Metiltransferasas , Neuronas Motoras , ARN Nuclear Pequeño , Animales , Humanos , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Drosophila/genética , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Células HeLa , Neuronas Motoras/metabolismo , Neuronas Motoras/patología , Fenotipo , ARN Nuclear Pequeño/metabolismo , Metiltransferasas/metabolismoRESUMEN
Molecular characterization of cell types using single-cell transcriptome sequencing is revolutionizing cell biology and enabling new insights into the physiology of human organs. We created a human reference atlas comprising nearly 500,000 cells from 24 different tissues and organs, many from the same donor. This atlas enabled molecular characterization of more than 400 cell types, their distribution across tissues, and tissue-specific variation in gene expression. Using multiple tissues from a single donor enabled identification of the clonal distribution of T cells between tissues, identification of the tissue-specific mutation rate in B cells, and analysis of the cell cycle state and proliferative potential of shared cell types across tissues. Cell type-specific RNA splicing was discovered and analyzed across tissues within an individual.
Asunto(s)
Atlas como Asunto , Células , Especificidad de Órganos , Empalme del ARN , Análisis de la Célula Individual , Transcriptoma , Linfocitos B/metabolismo , Células/metabolismo , Humanos , Especificidad de Órganos/genética , Linfocitos T/metabolismoRESUMEN
Detecting single-cell-regulated splicing from droplet-based technologies is challenging. Here, we introduce the splicing Z score (SpliZ), an annotation-free statistical method to detect regulated splicing in single-cell RNA sequencing. We applied the SpliZ to human lung cells, discovering hundreds of genes with cell-type-specific splicing patterns including ones with potential implications for basic and translational biology.
Asunto(s)
Empalme Alternativo , Empalme del ARN , HumanosRESUMEN
The extent splicing is regulated at single-cell resolution has remained controversial due to both available data and methods to interpret it. We apply the SpliZ, a new statistical approach, to detect cell-type-specific splicing in >110K cells from 12 human tissues. Using 10X Chromium data for discovery, 9.1% of genes with computable SpliZ scores are cell-type-specifically spliced, including ubiquitously expressed genes MYL6 and RPS24. These results are validated with RNA FISH, single-cell PCR, and Smart-seq2. SpliZ analysis reveals 170 genes with regulated splicing during human spermatogenesis, including examples conserved in mouse and mouse lemur. The SpliZ allows model-based identification of subpopulations indistinguishable based on gene expression, illustrated by subpopulation-specific splicing of classical monocytes involving an ultraconserved exon in SAT1. Together, this analysis of differential splicing across multiple organs establishes that splicing is regulated cell-type-specifically.
Asunto(s)
Cheirogaleidae/genética , Ratones/genética , Empalme del ARN , Análisis de la Célula Individual , AnimalesRESUMEN
Precise splice junction calls are currently unavailable in scRNA-seq pipelines such as the 10x Chromium platform but are critical for understanding single-cell biology. Here, we introduce SICILIAN, a new method that assigns statistical confidence to splice junctions from a spliced aligner to improve precision. SICILIAN is a general method that can be applied to bulk or single-cell data, but has particular utility for single-cell analysis due to that data's unique challenges and opportunities for discovery. SICILIAN's precise splice detection achieves high accuracy on simulated data, improves concordance between matched single-cell and bulk datasets, and increases agreement between biological replicates. SICILIAN detects unannotated splicing in single cells, enabling the discovery of novel splicing regulation through single-cell analysis workflows.