RESUMO
Today's genomics workflows typically require alignment to a reference sequence, which limits discovery. We introduce a unifying paradigm, SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), which directly analyzes raw sequencing data, using a statistical test to detect a signature of regulation: sample-specific sequence variation. SPLASH detects many types of variation and can be efficiently run at scale. We show that SPLASH identifies complex mutation patterns in SARS-CoV-2, discovers regulated RNA isoforms at the single-cell level, detects the vast sequence diversity of adaptive immune receptors, and uncovers biology in non-model organisms undocumented in their reference genomes: geographic and seasonal variation and diatom association in eelgrass, an oceanic plant impacted by climate change, and tissue-specific transcripts in octopus. SPLASH is a unifying approach to genomic analysis that enables expansive discovery without metadata or references.
Assuntos
Algoritmos , Genômica , Genoma , Análise de Sequência de RNA , Humanos , Antígenos HLA/genética , Análise de Célula ÚnicaRESUMO
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
Assuntos
Genoma , Genômica , Mapeamento CromossômicoRESUMO
The detection of circular RNA molecules (circRNAs) is typically based on short-read RNA sequencing data processed using computational tools. Numerous such tools have been developed, but a systematic comparison with orthogonal validation is missing. Here, we set up a circRNA detection tool benchmarking study, in which 16 tools detected more than 315,000 unique circRNAs in three deeply sequenced human cell types. Next, 1,516 predicted circRNAs were validated using three orthogonal methods. Generally, tool-specific precision is high and similar (median of 98.8%, 96.3% and 95.5% for qPCR, RNase R and amplicon sequencing, respectively) whereas the sensitivity and number of predicted circRNAs (ranging from 1,372 to 58,032) are the most significant differentiators. Of note, precision values are lower when evaluating low-abundance circRNAs. We also show that the tools can be used complementarily to increase detection sensitivity. Finally, we offer recommendations for future circRNA detection and validation.
Assuntos
Benchmarking , RNA Circular , Humanos , RNA Circular/genética , RNA/genética , RNA/metabolismo , Análise de Sequência de RNA/métodosRESUMO
Detecting single-cell-regulated splicing from droplet-based technologies is challenging. Here, we introduce the splicing Z score (SpliZ), an annotation-free statistical method to detect regulated splicing in single-cell RNA sequencing. We applied the SpliZ to human lung cells, discovering hundreds of genes with cell-type-specific splicing patterns including ones with potential implications for basic and translational biology.
Assuntos
Processamento Alternativo , Splicing de RNA , HumanosRESUMO
Trimethylguanosine synthase 1 (TGS1) is a highly conserved enzyme that converts the 5'-monomethylguanosine cap of small nuclear RNAs (snRNAs) to a trimethylguanosine cap. Here, we show that loss of TGS1 in Caenorhabditis elegans, Drosophila melanogaster and Danio rerio results in neurological phenotypes similar to those caused by survival motor neuron (SMN) deficiency. Importantly, expression of human TGS1 ameliorates the SMN-dependent neurological phenotypes in both flies and worms, revealing that TGS1 can partly counteract the effects of SMN deficiency. TGS1 loss in HeLa cells leads to the accumulation of immature U2 and U4atac snRNAs with long 3' tails that are often uridylated. snRNAs with defective 3' terminations also accumulate in Drosophila Tgs1 mutants. Consistent with defective snRNA maturation, TGS1 and SMN mutant cells also exhibit partially overlapping transcriptome alterations that include aberrantly spliced and readthrough transcripts. Together, these results identify a neuroprotective function for TGS1 and reinforce the view that defective snRNA maturation affects neuronal viability and function.
Assuntos
Metiltransferases , Neurônios Motores , RNA Nuclear Pequeno , Animais , Humanos , Caenorhabditis elegans/genética , Caenorhabditis elegans/metabolismo , Drosophila/genética , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Células HeLa , Neurônios Motores/metabolismo , Neurônios Motores/patologia , Fenótipo , RNA Nuclear Pequeno/metabolismo , Metiltransferases/metabolismoRESUMO
The pervasive expression of circular RNAs (circRNAs) is a recently discovered feature of gene expression in highly diverged eukaryotes. Numerous algorithms that are used to detect genome-wide circRNA expression from RNA sequencing (RNA-seq) data have been developed in the past few years, but there is little overlap in their predictions and no clear gold-standard method to assess the accuracy of these algorithms. We review sources of experimental and bioinformatic biases that complicate the accurate discovery of circRNAs and discuss statistical approaches to address these biases. We conclude with a discussion of the current experimental progress on the topic.
Assuntos
Biologia Computacional/métodos , Anotação de Sequência Molecular/estatística & dados numéricos , RNA/metabolismo , Análise de Sequência de RNA/métodos , Bases de Dados de Ácidos Nucleicos , Humanos , Anotação de Sequência Molecular/métodos , RNA/química , RNA Circular , SoftwareRESUMO
The extent to which gene fusions function as drivers of cancer remains a critical open question. Current algorithms do not sufficiently identify false-positive fusions arising during library preparation, sequencing, and alignment. Here, we introduce Data-Enriched Efficient PrEcise STatistical fusion detection (DEEPEST), an algorithm that uses statistical modeling to minimize false-positives while increasing the sensitivity of fusion detection. In 9,946 tumor RNA-sequencing datasets from The Cancer Genome Atlas (TCGA) across 33 tumor types, DEEPEST identifies 31,007 fusions, 30% more than identified by other methods, while calling 10-fold fewer false-positive fusions in nontransformed human tissues. We leverage the increased precision of DEEPEST to discover fundamental cancer biology. Namely, 888 candidate oncogenes are identified based on overrepresentation in DEEPEST calls, and 1,078 previously unreported fusions involving long intergenic noncoding RNAs, demonstrating a previously unappreciated prevalence and potential for function. DEEPEST also reveals a high enrichment for fusions involving oncogenes in cancers, including ovarian cancer, which has had minimal treatment advances in recent decades, finding that more than 50% of tumors harbor gene fusions predicted to be oncogenic. Specific protein domains are enriched in DEEPEST calls, indicating a global selection for fusion functionality: kinase domains are nearly 2-fold more enriched in DEEPEST calls than expected by chance, as are domains involved in (anaerobic) metabolism and DNA binding. The statistical algorithms, population-level analytic framework, and the biological conclusions of DEEPEST call for increased attention to gene fusions as drivers of cancer and for future research into using fusions for targeted therapy.
Assuntos
Fusão Gênica , Neoplasias/genética , Oncogenes , RNA Neoplásico/genética , Estatística como Assunto , Algoritmos , Sequência de Bases , Bases de Dados Genéticas , Instabilidade Genômica , Humanos , Proteoma/metabolismo , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismoRESUMO
MOTIVATION: Identification of splice sites is critical to gene annotation and to determine which sequences control circRNA biogenesis. Full-length RNA transcripts could in principle complete annotations of introns and exons in genomes without external ontologies, i.e., ab initio. However, whether it is possible to reconstruct genomic positions where splicing occurs from full-length transcripts, even if sampled in the absence of noise, depends on the genome sequence composition. If it is not, there exist provable limits on the use of RNA-Seq to define splice locations (linear or circular) in the genome. RESULTS: We provide a formal definition of splice site ambiguity due to the genomic sequence by introducing equivalent junction, which is the set of local genomic positions resulting in the same RNA sequence when joined through RNA splicing. We show that equivalent junctions are prevalent in diverse eukaryotic genomes and occur in 88.64% and 78.64% of annotated human splice sites in linear and circRNA junctions, respectively. The observed fractions of equivalent junctions and the frequency of many individual motifs are statistically significant when compared against the null distribution computed via simulation or closed-form. The frequency of equivalent junctions establishes a fundamental limit on the possibility of ab initio reconstruction of RNA transcripts without appealing to the ontology of "GT-AG" boundaries defining introns. Said differently, completely ab initio is impossible in the vast majority of splice sites in annotated circRNAs and linear transcripts. AVAILABILITY AND IMPLEMENTATION: Two python scripts generating an equivalent junction sequence per junction are available at: https://github.com/salzmanlab/Equivalent-Junctions. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma Humano , Processamento Alternativo , Éxons , Humanos , Íntrons , Sítios de Splice de RNA , Splicing de RNA , RNA CircularRESUMO
Next-generation sequencing is a cutting edge technology, but to quantify a dynamic range of abundances for different RNA or DNA species requires increasing sampling depth to levels that can be prohibitively expensive due to physical limits on molecular throughput of sequencers. To overcome this problem, we introduce a new general sampling theory which uses biophysical principles to functionally encode the abundance of a species before sampling, SeQUential depletIon and enriCHment (SQUICH). In theory and simulation, SQUICH enables sampling at a logarithmic rate to achieve the same precision as attained with conventional sequencing. A simple proof of principle experimental implementation of SQUICH in a controlled complex system of ~262,000 oligonucleotides already reduces sequencing depth by a factor of 10. SQUICH lays the groundwork for a general solution to a fundamental problem in molecular sampling and enables a new generation of efficient, precise molecular measurement at logarithmic or better sampling depth.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequência de Bases , Biologia Computacional , Simulação por Computador , DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Estudo de Prova de Conceito , RNA/genética , Estudos de Amostragem , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/estatística & dados numéricos , Análise de Sequência de RNA/métodos , Análise de Sequência de RNA/estatística & dados numéricos , Especificidade da EspécieRESUMO
ciRS-7 is an intensely studied, highly expressed and conserved circRNA. Essentially nothing is known about its biogenesis, including the location of its promoter. A prevailing assumption has been that ciRS-7 is an exceptional circRNA because it is transcribed from a locus lacking any mature linear RNA transcripts of the same sense. To study the biogenesis of ciRS-7, we developed an algorithm to define its promoter and predicted that the human ciRS-7 promoter coincides with that of the long non-coding RNA, LINC00632. We validated this prediction using multiple orthogonal experimental assays. We also used computational approaches and experimental validation to establish that ciRS-7 exonic sequence is embedded in linear transcripts that are flanked by cryptic exons in both human and mouse. Together, this experimental and computational evidence generates a new model for regulation of this locus: (a) ciRS-7 is like other circRNAs, as it is spliced into linear transcripts; (b) expression of ciRS-7 is primarily determined by the chromatin state of LINC00632 promoters; (c) transcription and splicing factors sufficient for ciRS-7 biogenesis are expressed in cells that lack detectable ciRS-7 expression. These findings have significant implications for the study of the regulation and function of ciRS-7, and the analytic framework we developed to jointly analyze RNA-seq and ChIP-seq data reveal the potential for genome-wide discovery of important biological regulation missed in current reference annotations.
Assuntos
RNA/biossíntese , RNA/genética , Algoritmos , Processamento Alternativo , Animais , Química Encefálica , Éxons , Feminino , Células HEK293 , Humanos , Camundongos , Gravidez , Splicing de RNA , RNA Circular , RNA Longo não Codificante/genética , Análise de Sequência de RNA/métodosRESUMO
In 2012, a new feature of eukaryotic gene expression emerged: ubiquitous expression of circular RNA (circRNA) from genes traditionally thought to express messenger or linear noncoding (nc)RNA only. CircRNAs are covalently closed, circular RNA molecules that typically comprise exonic sequences and are spliced at canonical splice sites. This feature of gene expression was first recognized in humans and mouse, but it quickly emerged that it was common across essentially all eukaryotes studied by molecular biologists. CircRNA abundance, and even which alternatively spliced circRNA isoforms are expressed, varies by cell type and can exceed the abundance of the traditional linear mRNA or ncRNA transcript. CircRNAs are enriched in the brain and increase in abundance during fetal development. Together, these features raise fundamental questions regarding the regulation of circRNA in cis and in trans, and its function.
Assuntos
Splicing de RNA/genética , RNA/genética , Regulação da Expressão Gênica , Humanos , RNA/biossíntese , RNA CircularRESUMO
Just a few years ago, it had been assumed that the dominant RNA isoforms produced from eukaryotic genes were variants of messenger RNA, functioning as intermediates in gene expression. In early 2012, however, a surprising discovery was made: circular RNA (circRNA) was shown to be a transcriptional product in thousands of human and mouse genes and in hundreds of cases constituted the dominant RNA isoform. Subsequent studies revealed that the expression of circRNAs is developmentally regulated, tissue and cell-type specific, and shared across the eukaryotic tree of life. These features suggest important functions for these molecules. Here, we describe major advances in the field of circRNA biology, focusing on the regulation of and functional roles played by these molecules.
Assuntos
Regulação da Expressão Gênica , RNA/genética , Animais , Biologia Computacional , Humanos , Modelos Genéticos , RNA/metabolismo , Splicing de RNA/genética , RNA CircularRESUMO
Gene fusions are known to play critical roles in tumor pathogenesis. Yet, sensitive and specific algorithms to detect gene fusions in cancer do not currently exist. In this paper, we present a new statistical algorithm, MACHETE (Mismatched Alignment CHimEra Tracking Engine), which achieves highly sensitive and specific detection of gene fusions from RNA-Seq data, including the highest Positive Predictive Value (PPV) compared to the current state-of-the-art, as assessed in simulated data. We show that the best performing published algorithms either find large numbers of fusions in negative control data or suffer from low sensitivity detecting known driving fusions in gold standard settings, such as EWSR1-FLI1. As proof of principle that MACHETE discovers novel gene fusions with high accuracy in vivo, we mined public data to discover and subsequently PCR validate novel gene fusions missed by other algorithms in the ovarian cancer cell line OVCAR3. These results highlight the gains in accuracy achieved by introducing statistical models into fusion detection, and pave the way for unbiased discovery of potentially driving and druggable gene fusions in primary tumors.
Assuntos
Algoritmos , Fusão Gênica , Biomarcadores Tumorais/genética , Linhagem Celular Tumoral , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Feminino , Proteínas de Fusão bcr-abl/genética , Genes abl , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/genética , Fusão Oncogênica , Proteínas de Fusão Oncogênica/genética , Neoplasias Ovarianas/genética , Alinhamento de Sequência , Análise de Sequência de RNARESUMO
Thousands of loci in the human and mouse genomes give rise to circular RNA transcripts; at many of these loci, the predominant RNA isoform is a circle. Using an improved computational approach for circular RNA identification, we found widespread circular RNA expression in Drosophila melanogaster and estimate that in humans, circular RNA may account for 1% as many molecules as poly(A) RNA. Analysis of data from the ENCODE consortium revealed that the repertoire of genes expressing circular RNA, the ratio of circular to linear transcripts for each gene, and even the pattern of splice isoforms of circular RNAs from each gene were cell-type specific. These results suggest that biogenesis of circular RNA is an integral, conserved, and regulated feature of the gene expression program.
Assuntos
Regulação da Expressão Gênica , RNA Mensageiro/genética , RNA/genética , Transcrição Gênica , Animais , Linhagem da Célula , Drosophila melanogaster/genética , Éxons/genética , Humanos , Camundongos , Poli A/genética , Isoformas de Proteínas/genética , RNA/biossíntese , Splicing de RNA/genética , RNA Circular , Análise de Sequência de RNARESUMO
Viruses may play an important role in the evolution of human microbial communities. Clustered regularly interspaced short palindromic repeats (CRISPRs) provide bacteria and archaea with adaptive immunity to previously encountered viruses. Little is known about CRISPR composition in members of human microbial communities, the relative rate of CRISPR locus change, or how CRISPR loci differ between the microbiota of different individuals. We collected saliva from four periodontally healthy human subjects over an 11- to 17-mo time period and analyzed CRISPR sequences with corresponding streptococcal repeats in order to improve our understanding of the predominant features of oral streptococcal adaptive immune repertoires. We analyzed a total of 6859 CRISPR bearing reads and 427,917 bacterial 16S rRNA gene sequences. We found a core (ranging from 7% to 22%) of shared CRISPR spacers that remained stable over time within each subject, but nearly a third of CRISPR spacers varied between time points. We document high spacer diversity within each subject, suggesting constant addition of new CRISPR spacers. No greater than 2% of CRISPR spacers were shared between subjects, suggesting that each individual was exposed to different virus populations. We detect changes in CRISPR spacer sequence diversity over time that may be attributable to locus diversification or to changes in streptococcal population structure, yet the composition of the populations within subjects remained relatively stable. The individual-specific and traceable character of CRISPR spacer complements could potentially open the way for expansion of the domain of personalized medicine to the oral microbiome, where lineages may be tracked as a function of health and other factors.
Assuntos
Variação Genética , Sequências Repetidas Invertidas/genética , Saliva/microbiologia , Streptococcus/classificação , Streptococcus/genética , DNA Bacteriano/genética , DNA Intergênico/genética , Ecossistema , Humanos , Filogenia , RNA Ribossômico 16S/genética , Análise de Sequência de DNA , Streptococcus/isolamento & purificaçãoRESUMO
BACKGROUND: Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) are utilized by bacteria to resist encounters with their viruses. Human body surfaces have numerous bacteria that harbor CRISPRs, and their content can provide clues as to the types and features of viruses they may have encountered. RESULTS: We investigated the conservation of CRISPR content from streptococci on skin and saliva of human subjects over 8-weeks to determine whether similarities existed in the CRISPR spacer profiles and whether CRISPR spacers were a stable component of each biogeographic site. Most of the CRISPR sequences identified were unique, but a small proportion of spacers from the skin and saliva of each subject matched spacers derived from previously sequenced loci of S. thermophilus and other streptococci. There were significant proportions of CRISPR spacers conserved over the entire 8-week study period for all subjects, and salivary CRISPR spacers sampled in the mornings showed significantly higher levels of conservation than any other time of day. We also found substantial similarities in the spacer repertoires of the skin and saliva of each subject. Many skin-derived spacers matched salivary viruses, supporting that bacteria of the skin may encounter viruses with similar sequences to those found in the mouth. Despite the similarities between skin and salivary spacer repertoires, the variation present was distinct based on each subject and body site. CONCLUSIONS: The conservation of CRISPR spacers in the saliva and the skin of human subjects over the time period studied suggests a relative conservation of the bacteria harboring them.
Assuntos
Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Sequência Conservada , Saliva/microbiologia , Pele/microbiologia , Streptococcus/classificação , Streptococcus/genética , Portador Sadio/microbiologia , Humanos , Infecções Estreptocócicas/microbiologia , Streptococcus/isolamento & purificaçãoRESUMO
Every year, ovarian cancer kills approximately 14,000 women in the United States and more than 140,000 women worldwide. Most of these deaths are caused by tumors of the serous histological type, which is rarely diagnosed before it has disseminated. By deep paired-end sequencing of mRNA from serous ovarian cancers, followed by deep sequencing of the corresponding genomic region, we identified a recurrent fusion transcript. The fusion transcript joins the 5' exons of ESRRA, encoding a ligand-independent member of the nuclear-hormone receptor superfamily, to the 3' exons of C11orf20, a conserved but uncharacterized gene located immediately upstream of ESRRA in the reference genome. To estimate the prevalence of the fusion, we tested 67 cases of serous ovarian cancer by RT-PCR and sequencing and confirmed its presence in 10 of these. Targeted resequencing of the corresponding genomic region from two fusion-positive tumor samples identified a nearly clonal chromosomal rearrangement positioning ESRRA upstream of C11orf20 in one tumor, and evidence of local copy number variation in the ESRRA locus in the second tumor. We hypothesize that the recurrent novel fusion transcript may play a role in pathogenesis of a substantial fraction of serous ovarian cancers and could provide a molecular marker for detection of the cancer. Gene fusions involving adjacent or nearby genes can readily escape detection but may play important roles in the development and progression of cancer.
Assuntos
Biomarcadores Tumorais/genética , Cromossomos Humanos Par 11/genética , Cistadenocarcinoma Seroso/genética , Neoplasias Epiteliais e Glandulares/genética , Proteínas de Fusão Oncogênica/genética , Neoplasias Ovarianas/genética , Receptores de Estrogênio/genética , Processamento Alternativo , Sequência de Aminoácidos , Canadá , Carcinoma Epitelial do Ovário , Estudos de Casos e Controles , Aberrações Cromossômicas , Cromossomos Humanos Par 11/química , Cistadenocarcinoma Seroso/epidemiologia , Cistadenocarcinoma Seroso/patologia , Variações do Número de Cópias de DNA , Éxons , Feminino , Humanos , Dados de Sequência Molecular , Estadiamento de Neoplasias , Neoplasias Epiteliais e Glandulares/epidemiologia , Neoplasias Epiteliais e Glandulares/patologia , Neoplasias Ovarianas/epidemiologia , Neoplasias Ovarianas/patologia , Prevalência , RNA Mensageiro , Análise de Sequência de DNA , Análise de Sequência de RNA , Estados Unidos , Receptor ERRalfa Relacionado ao EstrogênioRESUMO
SPLASH is an unsupervised, reference-free, and unifying algorithm that discovers regulated sequence variation through statistical analysis of k-mer composition, subsuming many application-specific methods. Here, we introduce SPLASH2, a fast, scalable implementation of SPLASH based on an efficient k-mer counting approach. SPLASH2 enables rapid analysis of massive datasets from a wide range of sequencing technologies and biological contexts, delivering unparalleled scale and speed. The SPLASH2 algorithm unveils new biology (without tuning) in single-cell RNA-sequencing data from human muscle cells, as well as bulk RNA-seq from the entire Cancer Cell Line Encyclopedia (CCLE), including substantial unannotated alternative splicing in cancer transcriptome. The same untuned SPLASH2 algorithm recovers the BCR-ABL gene fusion, and detects circRNA sensitively and specifically, underscoring SPLASH2's unmatched precision and scalability across diverse RNA-seq detection tasks.
RESUMO
We introduce SPLASH2, a fast, scalable implementation of SPLASH based on an efficient k-mer counting approach for regulated sequence variation detection in massive datasets from a wide range of sequencing technologies and biological contexts. We demonstrate biological discovery by SPLASH2 in single-cell RNA sequencing (RNA-seq) data and in bulk RNA-seq data from the Cancer Cell Line Encyclopedia, including unannotated alternative splicing in cancer transcriptomes and sensitive detection of circular RNA.