RESUMO
Most human protein-coding genes are regulated by multiple, distinct promoters, suggesting that the choice of promoter is as important as its level of transcriptional activity. However, while a global change in transcription is recognized as a defining feature of cancer, the contribution of alternative promoters still remains largely unexplored. Here, we infer active promoters using RNA-seq data from 18,468 cancer and normal samples, demonstrating that alternative promoters are a major contributor to context-specific regulation of transcription. We find that promoters are deregulated across tissues, cancer types, and patients, affecting known cancer genes and novel candidates. For genes with independently regulated promoters, we demonstrate that promoter activity provides a more accurate predictor of patient survival than gene expression. Our study suggests that a dynamic landscape of active promoters shapes the cancer transcriptome, opening new diagnostic avenues and opportunities to further explore the interplay of regulatory mechanisms with transcriptional aberrations in cancer.
Assuntos
Biologia Computacional/métodos , Regulação Neoplásica da Expressão Gênica/genética , Neoplasias/genética , Regiões Promotoras Genéticas/genética , Transcriptoma/genética , Bases de Dados Genéticas , Humanos , RNA-Seq/métodosRESUMO
Intraspecific genetic incompatibilities prevent the assembly of specific alleles into single genotypes and influence genome- and species-wide patterns of sequence variation. A common incompatibility in plants is hybrid necrosis, characterized by autoimmune responses due to epistatic interactions between natural genetic variants. By systematically testing thousands of F1 hybrids of Arabidopsis thaliana strains, we identified a small number of incompatibility hot spots in the genome, often in regions densely populated by nucleotide-binding domain and leucine-rich repeat (NLR) immune receptor genes. In several cases, these immune receptor loci interact with each other, suggestive of conflict within the immune system. A particularly dangerous locus is a highly variable cluster of NLR genes, DM2, which causes multiple independent incompatibilities with genes that encode a range of biochemical functions, including NLRs. Our findings suggest that deleterious interactions of immune receptors limit the combinations of favorable disease resistance alleles accessible to plant genomes.
Assuntos
Arabidopsis/genética , Arabidopsis/imunologia , Epistasia Genética , Sequência de Aminoácidos , Arabidopsis/classificação , Cruzamentos Genéticos , Genoma de Planta , Hibridização Genética , Dados de Sequência Molecular , Filogenia , Fenômenos Fisiológicos Vegetais , Alinhamento de SequênciaRESUMO
Transcription factor (TF) DNA sequence preferences direct their regulatory activity, but are currently known for only â¼1% of eukaryotic TFs. Broadly sampling DNA-binding domain (DBD) types from multiple eukaryotic clades, we determined DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. We find that closely related DBDs almost always have very similar DNA sequence preferences, enabling inference of motifs for â¼34% of the â¼170,000 known or predicted eukaryotic TFs. Sequences matching both measured and inferred motifs are enriched in chromatin immunoprecipitation sequencing (ChIP-seq) peaks and upstream of transcription start sites in diverse eukaryotic lineages. SNPs defining expression quantitative trait loci in Arabidopsis promoters are also enriched for predicted TF binding sites. Importantly, our motif "library" can be used to identify specific TFs whose binding may be altered by human disease risk alleles. These data present a powerful resource for mapping transcriptional networks across eukaryotes.
Assuntos
Arabidopsis/genética , Motivos de Nucleotídeos , Análise de Sequência de DNA , Fatores de Transcrição/metabolismo , Arabidopsis/metabolismo , Imunoprecipitação da Cromatina , Humanos , Polimorfismo de Nucleotídeo Único , Regiões Promotoras Genéticas , Ligação Proteica , Locos de Características QuantitativasRESUMO
Interferon-γ (IFN-γ) primes macrophages for enhanced microbial killing and inflammatory activation by Toll-like receptors (TLRs), but little is known about the regulation of cell metabolism or mRNA translation during this priming. We found that IFN-γ regulated the metabolism and mRNA translation of human macrophages by targeting the kinases mTORC1 and MNK, both of which converge on the selective regulator of translation initiation eIF4E. Physiological downregulation of mTORC1 by IFN-γ was associated with autophagy and translational suppression of repressors of inflammation such as HES1. Genome-wide ribosome profiling in TLR2-stimulated macrophages showed that IFN-γ selectively modulated the macrophage translatome to promote inflammation, further reprogram metabolic pathways and modulate protein synthesis. These results show that IFN-γ-mediated metabolic reprogramming and translational regulation are key components of classical inflammatory macrophage activation.
Assuntos
Interferon gama/imunologia , Ativação de Macrófagos/imunologia , Macrófagos/imunologia , Biossíntese de Proteínas/imunologia , RNA Mensageiro/imunologia , Sequência de Bases , Fatores de Transcrição Hélice-Alça-Hélice Básicos/genética , Fatores de Transcrição Hélice-Alça-Hélice Básicos/imunologia , Fatores de Transcrição Hélice-Alça-Hélice Básicos/metabolismo , Western Blotting , Células Cultivadas , Fator de Iniciação 4E em Eucariotos/genética , Fator de Iniciação 4E em Eucariotos/imunologia , Fator de Iniciação 4E em Eucariotos/metabolismo , Perfilação da Expressão Gênica , Proteínas de Homeodomínio/genética , Proteínas de Homeodomínio/imunologia , Proteínas de Homeodomínio/metabolismo , Humanos , Interferon gama/farmacologia , Peptídeos e Proteínas de Sinalização Intracelular/genética , Peptídeos e Proteínas de Sinalização Intracelular/imunologia , Peptídeos e Proteínas de Sinalização Intracelular/metabolismo , Ativação de Macrófagos/efeitos dos fármacos , Ativação de Macrófagos/genética , Macrófagos/efeitos dos fármacos , Macrófagos/metabolismo , Alvo Mecanístico do Complexo 1 de Rapamicina , MicroRNAs/genética , Microscopia de Fluorescência , Complexos Multiproteicos/genética , Complexos Multiproteicos/imunologia , Complexos Multiproteicos/metabolismo , Biossíntese de Proteínas/efeitos dos fármacos , Biossíntese de Proteínas/genética , Proteínas Serina-Treonina Quinases/genética , Proteínas Serina-Treonina Quinases/imunologia , Proteínas Serina-Treonina Quinases/metabolismo , Interferência de RNA , RNA Mensageiro/genética , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Transdução de Sinais/efeitos dos fármacos , Transdução de Sinais/genética , Transdução de Sinais/imunologia , Serina-Treonina Quinases TOR/genética , Serina-Treonina Quinases TOR/imunologia , Serina-Treonina Quinases TOR/metabolismo , Receptor 2 Toll-Like/genética , Receptor 2 Toll-Like/imunologia , Receptor 2 Toll-Like/metabolismo , Fatores de Transcrição HES-1RESUMO
Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.
Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Alinhamento de Sequência , Análise de Sequência de DNA/métodosRESUMO
Understanding and predicting molecular responses in single cells upon chemical, genetic or mechanical perturbations is a core question in biology. Obtaining single-cell measurements typically requires the cells to be destroyed. This makes learning heterogeneous perturbation responses challenging as we only observe unpaired distributions of perturbed or non-perturbed cells. Here we leverage the theory of optimal transport and the recent advent of input convex neural architectures to present CellOT, a framework for learning the response of individual cells to a given perturbation by mapping these unpaired distributions. CellOT outperforms current methods at predicting single-cell drug responses, as profiled by scRNA-seq and a multiplexed protein-imaging technology. Further, we illustrate that CellOT generalizes well on unseen settings by (1) predicting the scRNA-seq responses of holdout patients with lupus exposed to interferon-ß and patients with glioblastoma to panobinostat; (2) inferring lipopolysaccharide responses across different species; and (3) modeling the hematopoietic developmental trajectories of different subpopulations.
Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Humanos , Análise de Célula Única/métodos , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodosRESUMO
Transcript alterations often result from somatic changes in cancer genomes1. Various forms of RNA alterations have been described in cancer, including overexpression2, altered splicing3 and gene fusions4; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)5. Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed 'bridged' fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.
Assuntos
Regulação Neoplásica da Expressão Gênica , Neoplasias/genética , RNA/genética , Variações do Número de Cópias de DNA , DNA de Neoplasias , Genoma Humano , Genômica , Humanos , TranscriptomaRESUMO
Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
RESUMO
MOTIVATION: The Oxford Nanopore Technologies (ONT) ReadUntil API enables selective sequencing, which aims to selectively favor interesting over uninteresting reads, e.g. to deplete or enrich certain genomic regions. The performance gain depends on the selective sequencing decision-making algorithm (SSDA) which decides whether to reject a read, stop receiving a read, or wait for more data. Since real runs are time-consuming and costly, simulating the ONT sequencer with support for the ReadUntil API is highly beneficial for comparing and optimizing new SSDAs. Existing software like MinKNOW and UNCALLED only return raw signal data, are memory-intensive, require huge and often unavailable multi-fast5 files (≥100GB) and are not clearly documented. RESULTS: We present the ONT device simulator SimReadUntil that takes a set of full reads as input, distributes them to channels and plays them back in real time including mux scans, channel gaps and blockages, and allows to reject reads as well as stop receiving data from them. Our modified ReadUntil API provides the basecalled reads rather than the raw signal, reducing computational load and focusing on the SSDA rather than on basecalling. Tuning the parameters of tools like ReadFish and ReadBouncer becomes easier because a GPU for basecalling is no longer required. We offer various methods to extract simulation parameters from a sequencing summary file and adapt ReadFish to replicate one of their enrichment experiments. SimReadUntil's gRPC interface allows standardized interaction with a wide range of programming languages. AVAILABILITY AND IMPLEMENTATION: Code and fully worked examples are available on GitHub (https://github.com/ratschlab/sim_read_until).
Assuntos
Algoritmos , Benchmarking , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento por Nanoporos/métodosRESUMO
MOTIVATION: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
Assuntos
Algoritmos , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Software , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Bases de Dados GenéticasRESUMO
MOTIVATION: Acute kidney injury (AKI) is a syndrome that affects a large fraction of all critically ill patients, and early diagnosis to receive adequate treatment is as imperative as it is challenging to make early. Consequently, machine learning approaches have been developed to predict AKI ahead of time. However, the prevalence of AKI is often underestimated in state-of-the-art approaches, as they rely on an AKI event annotation solely based on creatinine, ignoring urine output.We construct and evaluate early warning systems for AKI in a multi-disciplinary ICU setting, using the complete KDIGO definition of AKI. We propose several variants of gradient-boosted decision tree (GBDT)-based models, including a novel time-stacking based approach. A state-of-the-art LSTM-based model previously proposed for AKI prediction is used as a comparison, which was not specifically evaluated in ICU settings yet. RESULTS: We find that optimal performance is achieved by using GBDT with the time-based stacking technique (AUPRC = 65.7%, compared with the LSTM-based model's AUPRC = 62.6%), which is motivated by the high relevance of time since ICU admission for this task. Both models show mildly reduced performance in the limited training data setting, perform fairly across different subcohorts, and exhibit no issues in gender transfer.Following the official KDIGO definition substantially increases the number of annotated AKI events. In our study GBDTs outperform LSTM models for AKI prediction. Generally, we find that both model types are robust in a variety of challenging settings arising for ICU data. AVAILABILITY AND IMPLEMENTATION: The code to reproduce the findings of our manuscript can be found at: https://github.com/ratschlab/AKI-EWS.
Assuntos
Injúria Renal Aguda , Unidades de Terapia Intensiva , Humanos , Aprendizado de Máquina , Masculino , Feminino , Árvores de Decisões , Idoso , Pessoa de Meia-IdadeRESUMO
MOTIVATION: Multimodal profiling strategies promise to produce more informative insights into biomedical cohorts via the integration of the information each modality contributes. To perform this integration, however, the development of novel analytical strategies is needed. Multimodal profiling strategies often come at the expense of lower sample numbers, which can challenge methods to uncover shared signals across a cohort. Thus, factor analysis approaches are commonly used for the analysis of high-dimensional data in molecular biology, however, they typically do not yield representations that are directly interpretable, whereas many research questions often center around the analysis of pathways associated with specific observations. RESULTS: We develop PathFA, a novel approach for multimodal factor analysis over the space of pathways. PathFA produces integrative and interpretable views across multimodal profiling technologies, which allow for the derivation of concrete hypotheses. PathFA combines a pathway-learning approach with integrative multimodal capability under a Bayesian procedure that is efficient, hyper-parameter free, and able to automatically infer observation noise from the data. We demonstrate strong performance on small sample sizes within our simulation framework and on matched proteomics and transcriptomics profiles from real tumor samples taken from the Swiss Tumor Profiler consortium. On a subcohort of melanoma patients, PathFA recovers pathway activity that has been independently associated with poor outcome. We further demonstrate the ability of this approach to identify pathways associated with the presence of specific cell-types as well as tumor heterogeneity. Our results show that we capture known biology, making it well suited for analyzing multimodal sample cohorts. AVAILABILITY AND IMPLEMENTATION: The tool is implemented in python and available at https://github.com/ratschlab/path-fa.
Assuntos
Teorema de Bayes , Humanos , Proteômica/métodos , Análise Fatorial , Perfilação da Expressão Gênica/métodos , Melanoma/metabolismo , Algoritmos , Biologia Computacional/métodosRESUMO
The number of published metagenome assemblies is rapidly growing due to advances in sequencing technologies. However, sequencing errors, variable coverage, repetitive genomic regions, and other factors can produce misassemblies, which are challenging to detect for taxonomically novel genomic data. Assembly errors can affect all downstream analyses of the assemblies. Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data. Here, we present the Residual neural network for Misassembled Contig identification (ResMiCo), a deep learning approach for reference-free identification of misassembled contigs. To develop ResMiCo, we first generated a training dataset of unprecedented size and complexity that can be used for further benchmarking and developments in the field. Through rigorous validation, we show that ResMiCo is substantially more accurate than the state of the art, and the model is robust to novel taxonomic diversity and varying assembly methods. ResMiCo estimated 7% misassembled contigs per metagenome across multiple real-world datasets. We demonstrate how ResMiCo can be used to optimize metagenome assembly hyperparameters to improve accuracy, instead of optimizing solely for contiguity. The accuracy, robustness, and ease-of-use of ResMiCo make the tool suitable for general quality control of metagenome assemblies and assembly methodology optimization.
Assuntos
Aprendizado Profundo , Metagenoma , Metagenoma/genética , Genômica/métodos , Análise de Sequência de DNA/métodos , Metagenômica , SoftwareRESUMO
The activation of memory T cells is a very rapid and concerted cellular response that requires coordination between cellular processes in different compartments and on different time scales. In this study, we use ribosome profiling and deep RNA sequencing to define the acute mRNA translation changes in CD8 memory T cells following initial activation events. We find that initial translation enables subsequent events of human and mouse T cell activation and expansion. Briefly, early events in the activation of Ag-experienced CD8 T cells are insensitive to transcriptional blockade with actinomycin D, and instead depend on the translation of pre-existing mRNAs and are blocked by cycloheximide. Ribosome profiling identifies â¼92 mRNAs that are recruited into ribosomes following CD8 T cell stimulation. These mRNAs typically have structured GC and pyrimidine-rich 5' untranslated regions and they encode key regulators of T cell activation and proliferation such as Notch1, Ifngr1, Il2rb, and serine metabolism enzymes Psat1 and Shmt2 (serine hydroxymethyltransferase 2), as well as translation factors eEF1a1 (eukaryotic elongation factor α1) and eEF2 (eukaryotic elongation factor 2). The increased production of receptors of IL-2 and IFN-γ precedes the activation of gene expression and augments cellular signals and T cell activation. Taken together, we identify an early RNA translation program that acts in a feed-forward manner to enable the rapid and dramatic process of CD8 memory T cell expansion and activation.
Assuntos
Glicina Hidroximetiltransferase , Interleucina-2 , Regiões 5' não Traduzidas , Animais , Linfócitos T CD8-Positivos , Cicloeximida/metabolismo , Dactinomicina/metabolismo , Glicina Hidroximetiltransferase/genética , Glicina Hidroximetiltransferase/metabolismo , Humanos , Memória Imunológica , Interleucina-2/metabolismo , Ativação Linfocitária , Células T de Memória , Camundongos , Fator 2 de Elongação de Peptídeos/genética , Fator 2 de Elongação de Peptídeos/metabolismo , Fatores de Alongamento de Peptídeos/genética , Pirimidinas/metabolismo , RNA Mensageiro/genética , Serina/genéticaRESUMO
Precision oncology relies on the accurate identification of somatic mutations in cancer patients. While the sequencing of the tumoral tissue is frequently part of routine clinical care, the healthy counterparts are rarely sequenced. We previously published PipeIT, a somatic variant calling workflow specific for Ion Torrent sequencing data enclosed in a Singularity container. PipeIT combines user-friendly execution, reproducibility and reliable mutation identification, but relies on matched germline sequencing data to exclude germline variants. Expanding on the original PipeIT, here we describe PipeIT2 to address the clinical need to define somatic mutations in the absence of germline control. We show that PipeIT2 achieves a > 95% recall for variants with variant allele fraction >10%, reliably detects driver and actionable mutations and filters out most of the germline mutations and sequencing artifacts. With its performance, reproducibility, and ease of execution, PipeIT2 is a valuable addition to molecular diagnostics laboratories.
Assuntos
Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/genética , Patologia Molecular , Fluxo de Trabalho , Reprodutibilidade dos Testes , Medicina de Precisão , Mutação , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
MOTIVATION: Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing. RESULTS: We developed Single Cell Data Tumor Clusterer (SECEDO, lat. 'to separate'), a new method to cluster tumor cells based solely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. We applied SECEDO to a synthetic dataset simulating 7250 cells and eight tumor subclones from a single patient and were able to accurately reconstruct the clonal composition, detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the total population. When applied to five real single-cell sequencing datasets from a breast cancer patient, each consisting of ≈2000 cells, SECEDO was able to recover the major clonal composition in each dataset at the original coverage of 0.03×, achieving an Adjusted Rand Index (ARI) score of ≈0.6. The current state-of-the-art SNV-based clustering method achieved an ARI score of ≈0, even after merging cells to create higher coverage data (factor 10 increase), and was only able to match SECEDOs performance when pooling data from all five datasets, in addition to artificially increasing the sequencing coverage by a factor of 7. Variant calling on the resulting clusters recovered more than twice as many SNVs as would have been detected if calling on all cells together. Further, the allelic ratio of the called SNVs on each subcluster was more than double relative to the allelic ratio of the SNVs called without clustering, thus demonstrating that calling variants on subclones, in addition to both increasing sensitivity of SNV detection and attaching SNVs to subclones, significantly increases the confidence of the called variants. AVAILABILITY AND IMPLEMENTATION: SECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo. Instructions to download the data and the evaluation code to reproduce the findings in this paper are available at: https://github.com/ratschlab/secedo-evaluation. The code and data of the submitted version are archived at: https://doi.org/10.5281/zenodo.6516955. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Neoplasias , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Teorema de Bayes , Análise de Sequência de DNA , Genoma , Sequência de Bases , Neoplasias/genética , Polimorfismo de Nucleotídeo ÚnicoRESUMO
MOTIVATION: Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. RESULTS: In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. AVAILABILITY AND IMPLEMENTATION: RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.
Assuntos
Algoritmos , Pesquisa Biomédica , SoftwareRESUMO
Tumors contain multiple subpopulations of genetically distinct cancer cells. Reconstructing their evolutionary history can improve our understanding of how cancers develop and respond to treatment. Subclonal reconstruction methods cluster mutations into groups that co-occur within the same subpopulations, estimate the frequency of cells belonging to each subpopulation, and infer the ancestral relationships among the subpopulations by constructing a clone tree. However, often multiple clone trees are consistent with the data and current methods do not efficiently capture this uncertainty; nor can these methods scale to clone trees with a large number of subclonal populations. Here, we formalize the notion of a partially-defined clone tree (partial clone tree for short) that defines a subset of the pairwise ancestral relationships in a clone tree, thereby implicitly representing the set of all clone trees that have these defined pairwise relationships. Also, we introduce a special partial clone tree, the Maximally-Constrained Ancestral Reconstruction (MAR), which summarizes all clone trees fitting the input data equally well. Finally, we extend commonly used clone tree validity conditions to apply to partial clone trees and describe SubMARine, a polynomial-time algorithm producing the subMAR, which approximates the MAR and guarantees that its defined relationships are a subset of those present in the MAR. We also extend SubMARine to work with subclonal copy number aberrations and define equivalence constraints for this purpose. Further, we extend SubMARine to permit noise in the estimates of the subclonal frequencies while retaining its validity conditions and guarantees. In contrast to other clone tree reconstruction methods, SubMARine runs in time and space that scale polynomially in the number of subclones. We show through extensive noise-free simulation, a large lung cancer dataset and a prostate cancer dataset that the subMAR equals the MAR in all cases where only a single clone tree exists and that it is a perfect match to the MAR in most of the other cases. Notably, SubMARine runs in less than 70 seconds on a single thread with less than one Gb of memory on all datasets presented in this paper, including ones with 50 nodes in a clone tree. On the real-world data, SubMARine almost perfectly recovers the previously reported trees and identifies minor errors made in the expert-driven reconstructions of those trees. The freely-available open-source code implementing SubMARine can be downloaded at https://github.com/morrislab/submarine.
Assuntos
Algoritmos , Biologia Computacional/métodos , Mutação/genética , Neoplasias , Simulação por Computador , Evolução Molecular , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/classificação , Neoplasias/genética , Sequenciamento Completo do GenomaRESUMO
Intra-tumor hypoxia is a common feature in many solid cancers. Although transcriptional targets of hypoxia-inducible factors (HIFs) have been well characterized, alternative splicing or processing of pre-mRNA transcripts which occurs during hypoxia and subsequent HIF stabilization is much less understood. Here, we identify many HIF-dependent alternative splicing events after whole transcriptome sequencing in pancreatic cancer cells exposed to hypoxia with and without downregulation of the aryl hydrocarbon receptor nuclear translocator (ARNT), a protein required for HIFs to form a transcriptionally active dimer. We correlate the discovered hypoxia-driven events with available sequencing data from pan-cancer TCGA patient cohorts to select a narrow set of putative biologically relevant splice events for experimental validation. We validate a small set of candidate HIF-dependent alternative splicing events in multiple human gastrointestinal cancer cell lines as well as patient-derived human pancreatic cancer organoids. Lastly, we report the discovery of a HIF-dependent mechanism to produce a hypoxia-dependent, long and coding isoform of the UDP-N-acetylglucosamine transporter SLC35A3.