Search | Nursing VHL Search Portal

1.

A Pan-cancer Transcriptome Analysis Reveals Pervasive Regulation through Alternative Promoters.

Demircioglu, Deniz; Cukuroglu, Engin; Kindermans, Martin; Nandi, Tannistha; Calabrese, Claudia; Fonseca, Nuno A; Kahles, André; Lehmann, Kjong-Van; Stegle, Oliver; Brazma, Alvis; Brooks, Angela N; Rätsch, Gunnar; Tan, Patrick; Göke, Jonathan.

Cell ; 178(6): 1465-1477.e17, 2019 09 05.

Article in English | MEDLINE | ID: mdl-31491388

ABSTRACT

Most human protein-coding genes are regulated by multiple, distinct promoters, suggesting that the choice of promoter is as important as its level of transcriptional activity. However, while a global change in transcription is recognized as a defining feature of cancer, the contribution of alternative promoters still remains largely unexplored. Here, we infer active promoters using RNA-seq data from 18,468 cancer and normal samples, demonstrating that alternative promoters are a major contributor to context-specific regulation of transcription. We find that promoters are deregulated across tissues, cancer types, and patients, affecting known cancer genes and novel candidates. For genes with independently regulated promoters, we demonstrate that promoter activity provides a more accurate predictor of patient survival than gene expression. Our study suggests that a dynamic landscape of active promoters shapes the cancer transcriptome, opening new diagnostic avenues and opportunities to further explore the interplay of regulatory mechanisms with transcriptional aberrations in cancer.

Subject(s)

Computational Biology/methods , Gene Expression Regulation, Neoplastic/genetics , Neoplasms/genetics , Promoter Regions, Genetic/genetics , Transcriptome/genetics , Databases, Genetic , Humans , RNA-Seq/methods

2.

Biosynthetic potential of the global ocean microbiome.

Paoli, Lucas; Ruscheweyh, Hans-Joachim; Forneris, Clarissa C; Hubrich, Florian; Kautsar, Satria; Bhushan, Agneya; Lotti, Alessandro; Clayssen, Quentin; Salazar, Guillem; Milanese, Alessio; Carlström, Charlotte I; Papadopoulou, Chrysa; Gehrig, Daniel; Karasikov, Mikhail; Mustafa, Harun; Larralde, Martin; Carroll, Laura M; Sánchez, Pablo; Zayed, Ahmed A; Cronin, Dylan R; Acinas, Silvia G; Bork, Peer; Bowler, Chris; Delmont, Tom O; Gasol, Josep M; Gossert, Alvar D; Kahles, André; Sullivan, Matthew B; Wincker, Patrick; Zeller, Georg; Robinson, Serina L; Piel, Jörn; Sunagawa, Shinichi.

Nature ; 607(7917): 111-118, 2022 07.

Article in English | MEDLINE | ID: mdl-35732736

ABSTRACT

Natural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups1, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds2,3. However, studying this diversity to identify genomic pathways for the synthesis of such compounds4 and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters ('Candidatus Eudoremicrobiaceae') that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environments.

Subject(s)

Biosynthetic Pathways , Microbiota , Oceans and Seas , Bacteria/classification , Bacteria/genetics , Biosynthetic Pathways/genetics , Genomics , Microbiota/genetics , Multigene Family/genetics , Phylogeny

3.

Aligning distant sequences to graphs using long seed sketches.

Joudaki, Amir; Meterez, Alexandru; Mustafa, Harun; Groot Koerkamp, Ragnar; Kahles, André; Rätsch, Gunnar.

Genome Res ; 33(7): 1208-1217, 2023 07.

Article in English | MEDLINE | ID: mdl-37072187

ABSTRACT

Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.

Subject(s)

Algorithms , Computational Biology , Computational Biology/methods , Sequence Alignment , Sequence Analysis, DNA/methods

4.

Analyses of non-coding somatic drivers in 2,658 cancer whole genomes.

Rheinbay, Esther; Nielsen, Morten Muhlig; Abascal, Federico; Wala, Jeremiah A; Shapira, Ofer; Tiao, Grace; Hornshøj, Henrik; Hess, Julian M; Juul, Randi Istrup; Lin, Ziao; Feuerbach, Lars; Sabarinathan, Radhakrishnan; Madsen, Tobias; Kim, Jaegil; Mularoni, Loris; Shuai, Shimin; Lanzós, Andrés; Herrmann, Carl; Maruvka, Yosef E; Shen, Ciyue; Amin, Samirkumar B; Bandopadhayay, Pratiti; Bertl, Johanna; Boroevich, Keith A; Busanovich, John; Carlevaro-Fita, Joana; Chakravarty, Dimple; Chan, Calvin Wing Yiu; Craft, David; Dhingra, Priyanka; Diamanti, Klev; Fonseca, Nuno A; Gonzalez-Perez, Abel; Guo, Qianyun; Hamilton, Mark P; Haradhvala, Nicholas J; Hong, Chen; Isaev, Keren; Johnson, Todd A; Juul, Malene; Kahles, Andre; Kahraman, Abdullah; Kim, Youngwook; Komorowski, Jan; Kumar, Kiran; Kumar, Sushant; Lee, Donghoon; Lehmann, Kjong-Van; Li, Yilong; Liu, Eric Minwei.

Nature ; 578(7793): 102-111, 2020 02.

Article in English | MEDLINE | ID: mdl-32025015

ABSTRACT

The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.

Subject(s)

Genome, Human/genetics , Mutation/genetics , Neoplasms/genetics , DNA Breaks , Databases, Genetic , Gene Expression Regulation, Neoplastic , Genome-Wide Association Study , Humans , INDEL Mutation

5.

Genomic basis for RNA alterations in cancer.

Calabrese, Claudia; Davidson, Natalie R; Demircioglu, Deniz; Fonseca, Nuno A; He, Yao; Kahles, André; Lehmann, Kjong-Van; Liu, Fenglin; Shiraishi, Yuichi; Soulette, Cameron M; Urban, Lara; Greger, Liliana; Li, Siliang; Liu, Dongbing; Perry, Marc D; Xiang, Qian; Zhang, Fan; Zhang, Junjun; Bailey, Peter; Erkek, Serap; Hoadley, Katherine A; Hou, Yong; Huska, Matthew R; Kilpinen, Helena; Korbel, Jan O; Marin, Maximillian G; Markowski, Julia; Nandi, Tannistha; Pan-Hammarström, Qiang; Pedamallu, Chandra Sekhar; Siebert, Reiner; Stark, Stefan G; Su, Hong; Tan, Patrick; Waszak, Sebastian M; Yung, Christina; Zhu, Shida; Awadalla, Philip; Creighton, Chad J; Meyerson, Matthew; Ouellette, B F Francis; Wu, Kui; Yang, Huanming; Brazma, Alvis; Brooks, Angela N; Göke, Jonathan; Rätsch, Gunnar; Schwarz, Roland F; Stegle, Oliver; Zhang, Zemin.

Nature ; 578(7793): 129-136, 2020 02.

Article in English | MEDLINE | ID: mdl-32025019

ABSTRACT

Transcript alterations often result from somatic changes in cancer genomes1. Various forms of RNA alterations have been described in cancer, including overexpression2, altered splicing3 and gene fusions4; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)5. Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed 'bridged' fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.

Subject(s)

Gene Expression Regulation, Neoplastic , Neoplasms/genetics , RNA/genetics , DNA Copy Number Variations , DNA, Neoplasm , Genome, Human , Genomics , Humans , Transcriptome

6.

Lossless indexing with counting de Bruijn graphs.

Karasikov, Mikhail; Mustafa, Harun; Rätsch, Gunnar; Kahles, André.

Genome Res ; 2022 May 24.

Article in English | MEDLINE | ID: mdl-35609994

ABSTRACT

Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.

7.

SimReadUntil for benchmarking selective sequencing algorithms on ONT devices.

Mordig, Maximilian; Rätsch, Gunnar; Kahles, André.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38603597

ABSTRACT

MOTIVATION: The Oxford Nanopore Technologies (ONT) ReadUntil API enables selective sequencing, which aims to selectively favor interesting over uninteresting reads, e.g. to deplete or enrich certain genomic regions. The performance gain depends on the selective sequencing decision-making algorithm (SSDA) which decides whether to reject a read, stop receiving a read, or wait for more data. Since real runs are time-consuming and costly, simulating the ONT sequencer with support for the ReadUntil API is highly beneficial for comparing and optimizing new SSDAs. Existing software like MinKNOW and UNCALLED only return raw signal data, are memory-intensive, require huge and often unavailable multi-fast5 files (≥100GB) and are not clearly documented. RESULTS: We present the ONT device simulator SimReadUntil that takes a set of full reads as input, distributes them to channels and plays them back in real time including mux scans, channel gaps and blockages, and allows to reject reads as well as stop receiving data from them. Our modified ReadUntil API provides the basecalled reads rather than the raw signal, reducing computational load and focusing on the SSDA rather than on basecalling. Tuning the parameters of tools like ReadFish and ReadBouncer becomes easier because a GPU for basecalling is no longer required. We offer various methods to extract simulation parameters from a sequencing summary file and adapt ReadFish to replicate one of their enrichment experiments. SimReadUntil's gRPC interface allows standardized interaction with a wide range of programming languages. AVAILABILITY AND IMPLEMENTATION: Code and fully worked examples are available on GitHub (https://github.com/ratschlab/sim_read_until).

Subject(s)

Algorithms , Benchmarking , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Nanopore Sequencing/methods

8.

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs.

Mustafa, Harun; Karasikov, Mikhail; Mansouri Ghiasi, Nika; Rätsch, Gunnar; Kahles, André.

Bioinformatics ; 40(Supplement_1): i337-i346, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940164

ABSTRACT

MOTIVATION: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

Subject(s)

Algorithms , Sequence Alignment , Sequence Alignment/methods , Software , Computational Biology/methods , Sequence Analysis, DNA/methods , Databases, Genetic

9.

Probabilistic pathway-based multimodal factor analysis.

Immer, Alexander; Stark, Stefan G; Jacob, Francis; Bonilla, Ximena; Thomas, Tinu; Kahles, André; Goetze, Sandra; Milani, Emanuela S; Wollscheid, Bernd; Rätsch, Gunnar; Lehmann, Kjong-Van.

Bioinformatics ; 40(Supplement_1): i189-i198, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940152

ABSTRACT

MOTIVATION: Multimodal profiling strategies promise to produce more informative insights into biomedical cohorts via the integration of the information each modality contributes. To perform this integration, however, the development of novel analytical strategies is needed. Multimodal profiling strategies often come at the expense of lower sample numbers, which can challenge methods to uncover shared signals across a cohort. Thus, factor analysis approaches are commonly used for the analysis of high-dimensional data in molecular biology, however, they typically do not yield representations that are directly interpretable, whereas many research questions often center around the analysis of pathways associated with specific observations. RESULTS: We develop PathFA, a novel approach for multimodal factor analysis over the space of pathways. PathFA produces integrative and interpretable views across multimodal profiling technologies, which allow for the derivation of concrete hypotheses. PathFA combines a pathway-learning approach with integrative multimodal capability under a Bayesian procedure that is efficient, hyper-parameter free, and able to automatically infer observation noise from the data. We demonstrate strong performance on small sample sizes within our simulation framework and on matched proteomics and transcriptomics profiles from real tumor samples taken from the Swiss Tumor Profiler consortium. On a subcohort of melanoma patients, PathFA recovers pathway activity that has been independently associated with poor outcome. We further demonstrate the ability of this approach to identify pathways associated with the presence of specific cell-types as well as tumor heterogeneity. Our results show that we capture known biology, making it well suited for analyzing multimodal sample cohorts. AVAILABILITY AND IMPLEMENTATION: The tool is implemented in python and available at https://github.com/ratschlab/path-fa.

Subject(s)

Bayes Theorem , Humans , Proteomics/methods , Factor Analysis, Statistical , Gene Expression Profiling/methods , Melanoma/metabolism , Algorithms , Computational Biology/methods

10.

Author Correction: Genomic basis for RNA alterations in cancer.

Calabrese, Claudia; Davidson, Natalie R; Demircioglu, Deniz; Fonseca, Nuno A; He, Yao; Kahles, André; Lehmann, Kjong-Van; Liu, Fenglin; Shiraishi, Yuichi; Soulette, Cameron M; Urban, Lara; Greger, Liliana; Li, Siliang; Liu, Dongbing; Perry, Marc D; Xiang, Qian; Zhang, Fan; Zhang, Junjun; Bailey, Peter; Erkek, Serap; Hoadley, Katherine A; Hou, Yong; Huska, Matthew R; Kilpinen, Helena; Korbel, Jan O; Marin, Maximillian G; Markowski, Julia; Nandi, Tannistha; Pan-Hammarström, Qiang; Pedamallu, Chandra Sekhar; Siebert, Reiner; Stark, Stefan G; Su, Hong; Tan, Patrick; Waszak, Sebastian M; Yung, Christina; Zhu, Shida; Awadalla, Philip; Creighton, Chad J; Meyerson, Matthew; Ouellette, B F Francis; Wu, Kui; Yang, Huanming; Brazma, Alvis; Brooks, Angela N; Göke, Jonathan; Rätsch, Gunnar; Schwarz, Roland F; Stegle, Oliver; Zhang, Zemin.

Nature ; 614(7948): E37, 2023 Feb.

Article in English | MEDLINE | ID: mdl-36697831

11.

Author Correction: Analyses of non-coding somatic drivers in 2,658 cancer whole genomes.

Rheinbay, Esther; Nielsen, Morten Muhlig; Abascal, Federico; Wala, Jeremiah A; Shapira, Ofer; Tiao, Grace; Hornshøj, Henrik; Hess, Julian M; Juul, Randi Istrup; Lin, Ziao; Feuerbach, Lars; Sabarinathan, Radhakrishnan; Madsen, Tobias; Kim, Jaegil; Mularoni, Loris; Shuai, Shimin; Lanzós, Andrés; Herrmann, Carl; Maruvka, Yosef E; Shen, Ciyue; Amin, Samirkumar B; Bandopadhayay, Pratiti; Bertl, Johanna; Boroevich, Keith A; Busanovich, John; Carlevaro-Fita, Joana; Chakravarty, Dimple; Chan, Calvin Wing Yiu; Craft, David; Dhingra, Priyanka; Diamanti, Klev; Fonseca, Nuno A; Gonzalez-Perez, Abel; Guo, Qianyun; Hamilton, Mark P; Haradhvala, Nicholas J; Hong, Chen; Isaev, Keren; Johnson, Todd A; Juul, Malene; Kahles, Andre; Kahraman, Abdullah; Kim, Youngwook; Komorowski, Jan; Kumar, Kiran; Kumar, Sushant; Lee, Donghoon; Lehmann, Kjong-Van; Li, Yilong; Liu, Eric Minwei.

Nature ; 614(7948): E40, 2023 Feb.

Article in English | MEDLINE | ID: mdl-36697832

12.

SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing.

Rozhonová, Hana; Danciu, Daniel; Stark, Stefan; Rätsch, Gunnar; Kahles, André; Lehmann, Kjong-Van.

Bioinformatics ; 38(18): 4293-4300, 2022 09 15.

Article in English | MEDLINE | ID: mdl-35900151

ABSTRACT

MOTIVATION: Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing. RESULTS: We developed Single Cell Data Tumor Clusterer (SECEDO, lat. 'to separate'), a new method to cluster tumor cells based solely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. We applied SECEDO to a synthetic dataset simulating 7250 cells and eight tumor subclones from a single patient and were able to accurately reconstruct the clonal composition, detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the total population. When applied to five real single-cell sequencing datasets from a breast cancer patient, each consisting of ≈2000 cells, SECEDO was able to recover the major clonal composition in each dataset at the original coverage of 0.03×, achieving an Adjusted Rand Index (ARI) score of ≈0.6. The current state-of-the-art SNV-based clustering method achieved an ARI score of ≈0, even after merging cells to create higher coverage data (factor 10 increase), and was only able to match SECEDOs performance when pooling data from all five datasets, in addition to artificially increasing the sequencing coverage by a factor of 7. Variant calling on the resulting clusters recovered more than twice as many SNVs as would have been detected if calling on all cells together. Further, the allelic ratio of the called SNVs on each subcluster was more than double relative to the allelic ratio of the SNVs called without clustering, thus demonstrating that calling variants on subclones, in addition to both increasing sensitivity of SNV detection and attaching SNVs to subclones, significantly increases the confidence of the called variants. AVAILABILITY AND IMPLEMENTATION: SECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo. Instructions to download the data and the evaluation code to reproduce the findings in this paper are available at: https://github.com/ratschlab/secedo-evaluation. The code and data of the submitted version are archived at: https://doi.org/10.5281/zenodo.6516955. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

High-Throughput Nucleotide Sequencing , Neoplasms , Humans , High-Throughput Nucleotide Sequencing/methods , Bayes Theorem , Sequence Analysis, DNA , Genome , Base Sequence , Neoplasms/genetics , Polymorphism, Single Nucleotide

13.

Topology-based sparsification of graph annotations.

Danciu, Daniel; Karasikov, Mikhail; Mustafa, Harun; Kahles, André; Rätsch, Gunnar.

Bioinformatics ; 37(Suppl_1): i169-i176, 2021 07 12.

Article in English | MEDLINE | ID: mdl-34252940

ABSTRACT

MOTIVATION: Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. RESULTS: In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. AVAILABILITY AND IMPLEMENTATION: RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.

Subject(s)

Algorithms , Biomedical Research , Software

14.

Dynamic compression schemes for graph coloring.

Mustafa, Harun; Schilken, Ingo; Karasikov, Mikhail; Eickhoff, Carsten; Rätsch, Gunnar; Kahles, André.

Bioinformatics ; 35(3): 407-414, 2019 02 01.

Article in English | MEDLINE | ID: mdl-30020403

ABSTRACT

Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability and implementation: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. Supplementary information: Supplementary data are available at Bioinformatics online.

Subject(s)

Computational Biology , Data Compression , Software , Algorithms , Color , Genomics , High-Throughput Nucleotide Sequencing

15.

Alternative Splicing Substantially Diversifies the Transcriptome during Early Photomorphogenesis and Correlates with the Energy Availability in Arabidopsis.

Hartmann, Lisa; Drewe-Boß, Philipp; Wießner, Theresa; Wagner, Gabriele; Geue, Sascha; Lee, Hsin-Chieh; Obermüller, Dominik M; Kahles, André; Behr, Jonas; Sinz, Fabian H; Rätsch, Gunnar; Wachter, Andreas.

Plant Cell ; 28(11): 2715-2734, 2016 11.

Article in English | MEDLINE | ID: mdl-27803310

ABSTRACT

Plants use light as source of energy and information to detect diurnal rhythms and seasonal changes. Sensing changing light conditions is critical to adjust plant metabolism and to initiate developmental transitions. Here, we analyzed transcriptome-wide alterations in gene expression and alternative splicing (AS) of etiolated seedlings undergoing photomorphogenesis upon exposure to blue, red, or white light. Our analysis revealed massive transcriptome reprogramming as reflected by differential expression of â¼20% of all genes and changes in several hundred AS events. For more than 60% of all regulated AS events, light promoted the production of a presumably protein-coding variant at the expense of an mRNA with nonsense-mediated decay-triggering features. Accordingly, AS of the putative splicing factor REDUCED RED-LIGHT RESPONSES IN CRY1CRY2 BACKGROUND1, previously identified as a red light signaling component, was shifted to the functional variant under light. Downstream analyses of candidate AS events pointed at a role of photoreceptor signaling only in monochromatic but not in white light. Furthermore, we demonstrated similar AS changes upon light exposure and exogenous sugar supply, with a critical involvement of kinase signaling. We propose that AS is an integration point of signaling pathways that sense and transmit information regarding the energy availability in plants.

Subject(s)

Alternative Splicing/physiology , Arabidopsis Proteins/metabolism , Arabidopsis/genetics , Transcriptome/genetics , Alternative Splicing/genetics , Arabidopsis/physiology , Arabidopsis Proteins/genetics , Gene Expression Regulation, Plant/genetics , Gene Expression Regulation, Plant/physiology , Signal Transduction/genetics , Signal Transduction/physiology

16.

MMR: a tool for read multi-mapper resolution.

Kahles, André; Behr, Jonas; Rätsch, Gunnar.

Bioinformatics ; 32(5): 770-2, 2016 03 01.

Article in English | MEDLINE | ID: mdl-26519503

ABSTRACT

MOTIVATION: Mapping high-throughput sequencing data to a reference genome is an essential step for most analysis pipelines aiming at the computational analysis of genome and transcriptome sequencing data. Breaking ties between equally well mapping locations poses a severe problem not only during the alignment phase but also has significant impact on the results of downstream analyses. We present the multi-mapper resolution (MMR) tool that infers optimal mapping locations from the coverage density of other mapped reads. RESULTS: Filtering alignments with MMR can significantly improve the performance of downstream analyses like transcript quantitation and differential testing. We illustrate that the accuracy (Spearman correlation) of transcript quantification increases by 15% when using reads of length 51. In addition, MMR decreases the alignment file sizes by more than 50%, and this leads to a reduced running time of the quantification tool. Our efficient implementation of the MMR algorithm is easily applicable as a post-processing step to existing alignment files in BAM format. Its complexity scales linearly with the number of alignments and requires no further inputs. AVAILABILITY AND IMPLEMENTATION: Open source code and documentation are available for download at http://github.com/ratschlab/mmr Comprehensive testing results and further information can be found at http://bioweb.me/mmr. CONTACT: andre.kahles@ratschlab.org or gunnar.ratsch@ratschlab.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Software , Algorithms , Genome , High-Throughput Nucleotide Sequencing , Sequence Alignment

17.

SplAdder: identification, quantification and testing of alternative splicing events from RNA-Seq data.

Kahles, André; Ong, Cheng Soon; Zhong, Yi; Rätsch, Gunnar.

Bioinformatics ; 32(12): 1840-7, 2016 06 15.

Article in English | MEDLINE | ID: mdl-26873928

ABSTRACT

MOTIVATION: Understanding the occurrence and regulation of alternative splicing (AS) is a key task towards explaining the regulatory processes that shape the complex transcriptomes of higher eukaryotes. With the advent of high-throughput sequencing of RNA (RNA-Seq), the diversity of AS transcripts could be measured at an unprecedented depth. Although the catalog of known AS events has grown ever since, novel transcripts are commonly observed when working with less well annotated organisms, in the context of disease, or within large populations. Whereas an identification of complete transcripts is technically challenging and computationally expensive, focusing on single splicing events as a proxy for transcriptome characteristics is fruitful and sufficient for a wide range of analyses. RESULTS: We present SplAdder, an alternative splicing toolbox, that takes RNA-Seq alignments and an annotation file as input to (i) augment the annotation based on RNA-Seq evidence, (ii) identify alternative splicing events present in the augmented annotation graph, (iii) quantify and confirm these events based on the RNA-Seq data and (iv) test for significant quantitative differences between samples. Thereby, our main focus lies on performance, accuracy and usability. AVAILABILITY: Source code and documentation are available for download at http://github.com/ratschlab/spladder Example data, introductory information and a small tutorial are accessible via http://bioweb.me/spladder CONTACTS: : andre.kahles@ratschlab.org or gunnar.ratsch@ratschlab.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Alternative Splicing , Gene Expression Profiling , RNA , Sequence Analysis, RNA , Transcriptome

18.

Multiple reference genomes and transcriptomes for Arabidopsis thaliana.

Gan, Xiangchao; Stegle, Oliver; Behr, Jonas; Steffen, Joshua G; Drewe, Philipp; Hildebrand, Katie L; Lyngsoe, Rune; Schultheiss, Sebastian J; Osborne, Edward J; Sreedharan, Vipin T; Kahles, André; Bohnert, Regina; Jean, Géraldine; Derwent, Paul; Kersey, Paul; Belfield, Eric J; Harberd, Nicholas P; Kemen, Eric; Toomajian, Christopher; Kover, Paula X; Clark, Richard M; Rätsch, Gunnar; Mott, Richard.

Nature ; 477(7365): 419-23, 2011 Aug 28.

Article in English | MEDLINE | ID: mdl-21874022

ABSTRACT

Genetic differences between Arabidopsis thaliana accessions underlie the plant's extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.

Subject(s)

Arabidopsis/genetics , Gene Expression Profiling , Gene Expression Regulation, Plant/genetics , Genome, Plant/genetics , Transcription, Genetic/genetics , Arabidopsis/classification , Arabidopsis Proteins/genetics , Base Sequence , Genes, Plant/genetics , Genomics , Haplotypes/genetics , INDEL Mutation/genetics , Molecular Sequence Annotation , Phylogeny , Polymorphism, Single Nucleotide/genetics , Proteome/genetics , Seedlings/genetics , Sequence Analysis, DNA

19.

Systematic evaluation of spliced alignment programs for RNA-seq data.

Engström, Pär G; Steijger, Tamara; Sipos, Botond; Grant, Gregory R; Kahles, André; Rätsch, Gunnar; Goldman, Nick; Hubbard, Tim J; Harrow, Jennifer; Guigó, Roderic; Bertone, Paul.

Nat Methods ; 10(12): 1185-91, 2013 Dec.

Article in English | MEDLINE | ID: mdl-24185836

ABSTRACT

High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. To assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. In total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.

Subject(s)

RNA Splicing , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Animals , Chromosome Mapping/methods , Computational Biology/methods , Exons , False Positive Reactions , High-Throughput Nucleotide Sequencing/methods , Humans , K562 Cells , Mice , RNA, Messenger/metabolism , Reproducibility of Results , Software

20.

Nonsense-mediated decay of alternative precursor mRNA splicing variants is a major determinant of the Arabidopsis steady state transcriptome.

Drechsel, Gabriele; Kahles, André; Kesarwani, Anil K; Stauffer, Eva; Behr, Jonas; Drewe, Philipp; Rätsch, Gunnar; Wachter, Andreas.

Plant Cell ; 25(10): 3726-42, 2013 Oct.

Article in English | MEDLINE | ID: mdl-24163313

ABSTRACT

The nonsense-mediated decay (NMD) surveillance pathway can recognize erroneous transcripts and physiological mRNAs, such as precursor mRNA alternative splicing (AS) variants. Currently, information on the global extent of coupled AS and NMD remains scarce and even absent for any plant species. To address this, we conducted transcriptome-wide splicing studies using Arabidopsis thaliana mutants in the NMD factor homologs UP FRAMESHIFT1 (UPF1) and UPF3 as well as wild-type samples treated with the translation inhibitor cycloheximide. Our analyses revealed that at least 17.4% of all multi-exon, protein-coding genes produce splicing variants that are targeted by NMD. Moreover, we provide evidence that UPF1 and UPF3 act in a translation-independent mRNA decay pathway. Importantly, 92.3% of the NMD-responsive mRNAs exhibit classical NMD-eliciting features, supporting their authenticity as direct targets. Genes generating NMD-sensitive AS variants function in diverse biological processes, including signaling and protein modification, for which NaCl stress-modulated AS-NMD was found. Besides mRNAs, numerous noncoding RNAs and transcripts derived from intergenic regions were shown to be NMD responsive. In summary, we provide evidence for a major function of AS-coupled NMD in shaping the Arabidopsis transcriptome, having fundamental implications in gene regulation and quality control of transcript processing.

Subject(s)

Alternative Splicing , Arabidopsis/genetics , Nonsense Mediated mRNA Decay , Transcriptome , Arabidopsis Proteins/genetics , Gene Expression Regulation, Plant , Genotype , Mutation , RNA Helicases/genetics , RNA, Plant/genetics , Sequence Analysis, RNA

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL