Search | VHL Regional Portal

1.

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs.

Mustafa, Harun; Karasikov, Mikhail; Mansouri Ghiasi, Nika; Rätsch, Gunnar; Kahles, André.

Bioinformatics ; 40(Supplement_1): i337-i346, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940164

ABSTRACT

MOTIVATION: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

Subject(s)

Algorithms , Sequence Alignment , Sequence Alignment/methods , Software , Computational Biology/methods , Sequence Analysis, DNA/methods , Databases, Genetic

2.

Probabilistic pathway-based multimodal factor analysis.

Immer, Alexander; Stark, Stefan G; Jacob, Francis; Bonilla, Ximena; Thomas, Tinu; Kahles, André; Goetze, Sandra; Milani, Emanuela S; Wollscheid, Bernd; Rätsch, Gunnar; Lehmann, Kjong-Van.

Bioinformatics ; 40(Supplement_1): i189-i198, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940152

ABSTRACT

MOTIVATION: Multimodal profiling strategies promise to produce more informative insights into biomedical cohorts via the integration of the information each modality contributes. To perform this integration, however, the development of novel analytical strategies is needed. Multimodal profiling strategies often come at the expense of lower sample numbers, which can challenge methods to uncover shared signals across a cohort. Thus, factor analysis approaches are commonly used for the analysis of high-dimensional data in molecular biology, however, they typically do not yield representations that are directly interpretable, whereas many research questions often center around the analysis of pathways associated with specific observations. RESULTS: We develop PathFA, a novel approach for multimodal factor analysis over the space of pathways. PathFA produces integrative and interpretable views across multimodal profiling technologies, which allow for the derivation of concrete hypotheses. PathFA combines a pathway-learning approach with integrative multimodal capability under a Bayesian procedure that is efficient, hyper-parameter free, and able to automatically infer observation noise from the data. We demonstrate strong performance on small sample sizes within our simulation framework and on matched proteomics and transcriptomics profiles from real tumor samples taken from the Swiss Tumor Profiler consortium. On a subcohort of melanoma patients, PathFA recovers pathway activity that has been independently associated with poor outcome. We further demonstrate the ability of this approach to identify pathways associated with the presence of specific cell-types as well as tumor heterogeneity. Our results show that we capture known biology, making it well suited for analyzing multimodal sample cohorts. AVAILABILITY AND IMPLEMENTATION: The tool is implemented in python and available at https://github.com/ratschlab/path-fa.

Subject(s)

Bayes Theorem , Humans , Proteomics/methods , Factor Analysis, Statistical , Gene Expression Profiling/methods , Melanoma/metabolism , Algorithms , Computational Biology/methods

3.

SimReadUntil for benchmarking selective sequencing algorithms on ONT devices.

Mordig, Maximilian; Rätsch, Gunnar; Kahles, André.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38603597

ABSTRACT

MOTIVATION: The Oxford Nanopore Technologies (ONT) ReadUntil API enables selective sequencing, which aims to selectively favor interesting over uninteresting reads, e.g. to deplete or enrich certain genomic regions. The performance gain depends on the selective sequencing decision-making algorithm (SSDA) which decides whether to reject a read, stop receiving a read, or wait for more data. Since real runs are time-consuming and costly, simulating the ONT sequencer with support for the ReadUntil API is highly beneficial for comparing and optimizing new SSDAs. Existing software like MinKNOW and UNCALLED only return raw signal data, are memory-intensive, require huge and often unavailable multi-fast5 files (≥100GB) and are not clearly documented. RESULTS: We present the ONT device simulator SimReadUntil that takes a set of full reads as input, distributes them to channels and plays them back in real time including mux scans, channel gaps and blockages, and allows to reject reads as well as stop receiving data from them. Our modified ReadUntil API provides the basecalled reads rather than the raw signal, reducing computational load and focusing on the SSDA rather than on basecalling. Tuning the parameters of tools like ReadFish and ReadBouncer becomes easier because a GPU for basecalling is no longer required. We offer various methods to extract simulation parameters from a sequencing summary file and adapt ReadFish to replicate one of their enrichment experiments. SimReadUntil's gRPC interface allows standardized interaction with a wide range of programming languages. AVAILABILITY AND IMPLEMENTATION: Code and fully worked examples are available on GitHub (https://github.com/ratschlab/sim_read_until).

Subject(s)

Algorithms , Benchmarking , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Nanopore Sequencing/methods

4.

Mutant SF3B1 promotes malignancy in PDAC.

Simmler, Patrik; Ioannidi, Eleonora I; Mengis, Tamara; Marquart, Kim Fabiano; Asawa, Simran; Van-Lehmann, Kjong; Kahles, Andre; Thomas, Tinu; Schwerdel, Cornelia; Aceto, Nicola; Rätsch, Gunnar; Stoffel, Markus; Schwank, Gerald.

Elife ; 122023 10 12.

Article in English | MEDLINE | ID: mdl-37823551

ABSTRACT

The splicing factor SF3B1 is recurrently mutated in various tumors, including pancreatic ductal adenocarcinoma (PDAC). The impact of the hotspot mutation SF3B1K700E on the PDAC pathogenesis, however, remains elusive. Here, we demonstrate that Sf3b1K700E alone is insufficient to induce malignant transformation of the murine pancreas, but that it increases aggressiveness of PDAC if it co-occurs with mutated KRAS and p53. We further show that Sf3b1K700E already plays a role during early stages of pancreatic tumor progression and reduces the expression of TGF-ß1-responsive epithelial-mesenchymal transition (EMT) genes. Moreover, we found that SF3B1K700E confers resistance to TGF-ß1-induced cell death in pancreatic organoids and cell lines, partly mediated through aberrant splicing of Map3k7. Overall, our findings demonstrate that SF3B1K700E acts as an oncogenic driver in PDAC, and suggest that it promotes the progression of early stage tumors by impeding the cellular response to tumor suppressive effects of TGF-ß.

Subject(s)

Carcinoma, Pancreatic Ductal , Pancreatic Neoplasms , Animals , Humans , Mice , Carcinoma, Pancreatic Ductal/pathology , Cell Line, Tumor , Mutation , Pancreatic Ducts/metabolism , Pancreatic Neoplasms/pathology , Phosphoproteins/metabolism , RNA Splicing Factors/metabolism , Transcription Factors/metabolism , Transforming Growth Factor beta1/metabolism , Pancreatic Neoplasms

5.

Aligning distant sequences to graphs using long seed sketches.

Joudaki, Amir; Meterez, Alexandru; Mustafa, Harun; Groot Koerkamp, Ragnar; Kahles, André; Rätsch, Gunnar.

Genome Res ; 33(7): 1208-1217, 2023 07.

Article in English | MEDLINE | ID: mdl-37072187

ABSTRACT

Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.

Subject(s)

Algorithms , Computational Biology , Computational Biology/methods , Sequence Alignment , Sequence Analysis, DNA/methods

6.

Author Correction: Genomic basis for RNA alterations in cancer.

Calabrese, Claudia; Davidson, Natalie R; Demircioglu, Deniz; Fonseca, Nuno A; He, Yao; Kahles, André; Lehmann, Kjong-Van; Liu, Fenglin; Shiraishi, Yuichi; Soulette, Cameron M; Urban, Lara; Greger, Liliana; Li, Siliang; Liu, Dongbing; Perry, Marc D; Xiang, Qian; Zhang, Fan; Zhang, Junjun; Bailey, Peter; Erkek, Serap; Hoadley, Katherine A; Hou, Yong; Huska, Matthew R; Kilpinen, Helena; Korbel, Jan O; Marin, Maximillian G; Markowski, Julia; Nandi, Tannistha; Pan-Hammarström, Qiang; Pedamallu, Chandra Sekhar; Siebert, Reiner; Stark, Stefan G; Su, Hong; Tan, Patrick; Waszak, Sebastian M; Yung, Christina; Zhu, Shida; Awadalla, Philip; Creighton, Chad J; Meyerson, Matthew; Ouellette, B F Francis; Wu, Kui; Yang, Huanming; Brazma, Alvis; Brooks, Angela N; Göke, Jonathan; Rätsch, Gunnar; Schwarz, Roland F; Stegle, Oliver; Zhang, Zemin.

Nature ; 614(7948): E37, 2023 Feb.

Article in English | MEDLINE | ID: mdl-36697831

7.

Author Correction: Analyses of non-coding somatic drivers in 2,658 cancer whole genomes.

Rheinbay, Esther; Nielsen, Morten Muhlig; Abascal, Federico; Wala, Jeremiah A; Shapira, Ofer; Tiao, Grace; Hornshøj, Henrik; Hess, Julian M; Juul, Randi Istrup; Lin, Ziao; Feuerbach, Lars; Sabarinathan, Radhakrishnan; Madsen, Tobias; Kim, Jaegil; Mularoni, Loris; Shuai, Shimin; Lanzós, Andrés; Herrmann, Carl; Maruvka, Yosef E; Shen, Ciyue; Amin, Samirkumar B; Bandopadhayay, Pratiti; Bertl, Johanna; Boroevich, Keith A; Busanovich, John; Carlevaro-Fita, Joana; Chakravarty, Dimple; Chan, Calvin Wing Yiu; Craft, David; Dhingra, Priyanka; Diamanti, Klev; Fonseca, Nuno A; Gonzalez-Perez, Abel; Guo, Qianyun; Hamilton, Mark P; Haradhvala, Nicholas J; Hong, Chen; Isaev, Keren; Johnson, Todd A; Juul, Malene; Kahles, Andre; Kahraman, Abdullah; Kim, Youngwook; Komorowski, Jan; Kumar, Kiran; Kumar, Sushant; Lee, Donghoon; Lehmann, Kjong-Van; Li, Yilong; Liu, Eric Minwei.

Nature ; 614(7948): E40, 2023 Feb.

Article in English | MEDLINE | ID: mdl-36697832

8.

A history of the MetaSUB consortium: Tracking urban microbes around the globe.

Ryon, Krista A; Tierney, Braden T; Frolova, Alina; Kahles, Andre; Desnues, Christelle; Ouzounis, Christos; Gibas, Cynthis; Bezdan, Daniela; Deng, Youping; He, Ding; Dias-Neto, Emmanuel; Elhaik, Eran; Afshin, Evan; Grills, George; Iraola, Gregorio; Suzuki, Haruo; Werner, Johannes; Udekwu, Klas; Schriml, Lynn; Bhattacharyya, Malay; Oliveira, Manuela; Zambrano, Maria Mercedes; Hazrin-Chong, Nur Hazlin; Osuolale, Olayinka; Labaj, Pawel P; Tiasse, Prisca; Rapuri, Sampath; Borras, Silvia; Pozdniakova, Sofya; Shi, Tieliu; Sezerman, Ugur; Rodo, Xavier; Sezer, Zehra Hazal; Mason, Christopher E.

iScience ; 25(11): 104993, 2022 Nov 18.

Article in English | MEDLINE | ID: mdl-36299999

ABSTRACT

The MetaSUB Consortium, founded in 2015, is a global consortium with an interdisciplinary team of clinicians, scientists, bioinformaticians, engineers, and designers, with members from more than 100 countries across the globe. This network has continually collected samples from urban and rural sites including subways and transit systems, sewage systems, hospitals, and other environmental sampling. These collections have been ongoing since 2015 and have continued when possible, even throughout the COVID-19 pandemic. The consortium has optimized their workflow for the collection, isolation, and sequencing of DNA and RNA collected from these various sites and processing them for metagenomics analysis, including the identification of SARS-CoV-2 and its variants. Here, the Consortium describes its foundations, and its ongoing work to expand on this network and to focus its scope on the mapping, annotation, and prediction of emerging pathogens, mapping microbial evolution and antibiotic resistance, and the discovery of novel organisms and biosynthetic gene clusters.

9.

RNA Instant Quality Check: Alignment-Free RNA-Degradation Detection.

Lehmann, Kjong-van; Kahles, Andre; Murr, Magdalena; Rätsch, Gunnar.

J Comput Biol ; 29(8): 857-866, 2022 08.

Article in English | MEDLINE | ID: mdl-35776515

ABSTRACT

With the constant increase of large-scale genomic data projects, automated and high-throughput quality assessment becomes a crucial component of any analysis. Whereas small projects often have a more homogeneous design and a manageable structure allowing for a manual per-sample analysis of quality, large-scale studies tend to be much more heterogeneous and complex. Many quality metrics have been developed to assess the quality of an individual sample on the raw read level. Degradation effects are typically assessed based on the RNA integrity (RIN) score, or on postalignment data. In this study, we show that single commonly used quality criteria such as the RIN score alone are not sufficient to ensure RNA sample quality. We developed a new approach and provide an efficient tool that estimates RNA sample degradation by computing the 5'/3' bias based on all genes in an alignment-free manner. That enables degradation assessment right after data generation and not during the analysis procedure allowing for early intervention in the sample handling process. Our analysis shows that this strategy is fast, robust to annotation and differences in library size, and provides complementary quality information to RIN scores enabling the accurate identification of degraded samples.

Subject(s)

RNA Stability , RNA , Genomics , RNA/chemistry , RNA/genetics , Sequence Analysis, RNA/methods

10.

SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing.

Rozhonová, Hana; Danciu, Daniel; Stark, Stefan; Rätsch, Gunnar; Kahles, André; Lehmann, Kjong-Van.

Bioinformatics ; 38(18): 4293-4300, 2022 09 15.

Article in English | MEDLINE | ID: mdl-35900151

ABSTRACT

MOTIVATION: Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing. RESULTS: We developed Single Cell Data Tumor Clusterer (SECEDO, lat. 'to separate'), a new method to cluster tumor cells based solely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. We applied SECEDO to a synthetic dataset simulating 7250 cells and eight tumor subclones from a single patient and were able to accurately reconstruct the clonal composition, detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the total population. When applied to five real single-cell sequencing datasets from a breast cancer patient, each consisting of ≈2000 cells, SECEDO was able to recover the major clonal composition in each dataset at the original coverage of 0.03×, achieving an Adjusted Rand Index (ARI) score of ≈0.6. The current state-of-the-art SNV-based clustering method achieved an ARI score of ≈0, even after merging cells to create higher coverage data (factor 10 increase), and was only able to match SECEDOs performance when pooling data from all five datasets, in addition to artificially increasing the sequencing coverage by a factor of 7. Variant calling on the resulting clusters recovered more than twice as many SNVs as would have been detected if calling on all cells together. Further, the allelic ratio of the called SNVs on each subcluster was more than double relative to the allelic ratio of the SNVs called without clustering, thus demonstrating that calling variants on subclones, in addition to both increasing sensitivity of SNV detection and attaching SNVs to subclones, significantly increases the confidence of the called variants. AVAILABILITY AND IMPLEMENTATION: SECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo. Instructions to download the data and the evaluation code to reproduce the findings in this paper are available at: https://github.com/ratschlab/secedo-evaluation. The code and data of the submitted version are archived at: https://doi.org/10.5281/zenodo.6516955. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

High-Throughput Nucleotide Sequencing , Neoplasms , Humans , High-Throughput Nucleotide Sequencing/methods , Bayes Theorem , Sequence Analysis, DNA , Genome , Base Sequence , Neoplasms/genetics , Polymorphism, Single Nucleotide

11.

Biosynthetic potential of the global ocean microbiome.

Paoli, Lucas; Ruscheweyh, Hans-Joachim; Forneris, Clarissa C; Hubrich, Florian; Kautsar, Satria; Bhushan, Agneya; Lotti, Alessandro; Clayssen, Quentin; Salazar, Guillem; Milanese, Alessio; Carlström, Charlotte I; Papadopoulou, Chrysa; Gehrig, Daniel; Karasikov, Mikhail; Mustafa, Harun; Larralde, Martin; Carroll, Laura M; Sánchez, Pablo; Zayed, Ahmed A; Cronin, Dylan R; Acinas, Silvia G; Bork, Peer; Bowler, Chris; Delmont, Tom O; Gasol, Josep M; Gossert, Alvar D; Kahles, André; Sullivan, Matthew B; Wincker, Patrick; Zeller, Georg; Robinson, Serina L; Piel, Jörn; Sunagawa, Shinichi.

Nature ; 607(7917): 111-118, 2022 07.

Article in English | MEDLINE | ID: mdl-35732736

ABSTRACT

Natural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups1, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds2,3. However, studying this diversity to identify genomic pathways for the synthesis of such compounds4 and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters ('Candidatus Eudoremicrobiaceae') that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environments.

Subject(s)

Biosynthetic Pathways , Microbiota , Oceans and Seas , Bacteria/classification , Bacteria/genetics , Biosynthetic Pathways/genetics , Genomics , Microbiota/genetics , Multigene Family/genetics , Phylogeny

12.

Identification, Quantification, and Testing of Alternative Splicing Events from RNA-Seq Data Using SplAdder.

Markolin, Philipp; Rätsch, Gunnar; Kahles, André.

Methods Mol Biol ; 2493: 167-193, 2022.

Article in English | MEDLINE | ID: mdl-35751815

ABSTRACT

Alternative splicing (AS) is a regulatory process during mRNA maturation that shapes higher eukaryotes' complex transcriptomes. High-throughput sequencing of RNA (RNA-Seq) allows for measurements of AS transcripts at an unprecedented depth and diversity. The ever-expanding catalog of known AS events provides biological insights into gene regulation, population genetics, or in the context of disease. Here, we present an overview on the usage of SplAdder, a graph-based alternative splicing toolbox, which can integrate an arbitrarily large number of RNA-Seq alignments and a given annotation file to augment the shared annotation based on RNA-Seq evidence. The shared augmented annotation graph is then used to identify, quantify, and confirm alternative splicing events based on the RNA-Seq data. Splice graphs for individual alignments can also be tested for significant quantitative differences between other samples or groups of samples.

Subject(s)

Alternative Splicing , RNA , High-Throughput Nucleotide Sequencing , RNA/genetics , RNA-Seq , Sequence Analysis, RNA

13.

Lossless indexing with counting de Bruijn graphs.

Karasikov, Mikhail; Mustafa, Harun; Rätsch, Gunnar; Kahles, André.

Genome Res ; 2022 May 24.

Article in English | MEDLINE | ID: mdl-35609994

ABSTRACT

Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.

14.

Topology-based sparsification of graph annotations.

Danciu, Daniel; Karasikov, Mikhail; Mustafa, Harun; Kahles, André; Rätsch, Gunnar.

Bioinformatics ; 37(Suppl_1): i169-i176, 2021 07 12.

Article in English | MEDLINE | ID: mdl-34252940

ABSTRACT

MOTIVATION: Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. RESULTS: In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. AVAILABILITY AND IMPLEMENTATION: RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.

Subject(s)

Algorithms , Biomedical Research , Software

15.

Building an international consortium for tracking coronavirus health status.

Segal, Eran; Zhang, Feng; Lin, Xihong; King, Gary; Shalem, Ophir; Shilo, Smadar; Allen, William E; Alquaddoomi, Faisal; Altae-Tran, Han; Anders, Simon; Balicer, Ran; Bauman, Tal; Bonilla, Ximena; Booman, Gisel; Chan, Andrew T; Cohen, Ori; Coletti, Silvano; Davidson, Natalie; Dor, Yuval; Drew, David A; Elemento, Olivier; Evans, Georgina; Ewels, Phil; Gale, Joshua; Gavrieli, Amir; Geiger, Benjamin; Grad, Yonatan H; Greene, Casey S; Hajirasouliha, Iman; Jerala, Roman; Kahles, Andre; Kallioniemi, Olli; Keshet, Ayya; Kocarev, Ljupco; Landua, Gregory; Meir, Tomer; Muller, Aline; Nguyen, Long H; Oresic, Matej; Ovchinnikova, Svetlana; Peterson, Hedi; Prodanova, Jana; Rajagopal, Jay; Rätsch, Gunnar; Rossman, Hagai; Rung, Johan; Sboner, Andrea; Sigaras, Alexandros; Spector, Tim; Steinherz, Ron.

Nat Med ; 26(8): 1161-1165, 2020 08.

Article in English | MEDLINE | ID: mdl-32488218

Subject(s)

Betacoronavirus/pathogenicity , Coronavirus Infections/epidemiology , Pandemics/statistics & numerical data , Pneumonia, Viral/epidemiology , Surveys and Questionnaires/statistics & numerical data , COVID-19 , Coronavirus Infections/prevention & control , Coronavirus Infections/virology , Health Status , Humans , Pandemics/prevention & control , Pneumonia, Viral/prevention & control , Pneumonia, Viral/virology , SARS-CoV-2

16.

Publisher Correction: Building an international consortium for tracking coronavirus health status.

Segal, Eran; Zhang, Feng; Lin, Xihong; King, Gary; Shalem, Ophir; Shilo, Smadar; Allen, William E; Alquaddoomi, Faisal; Altae-Tran, Han; Anders, Simon; Balicer, Ran; Bauman, Tal; Bonilla, Ximena; Booman, Gisel; Chan, Andrew T; Cohen, Ori; Coletti, Silvano; Davidson, Natalie; Dor, Yuval; Drew, David A; Elemento, Olivier; Evans, Georgina; Ewels, Phil; Gale, Joshua; Gavrieli, Amir; Geiger, Benjamin; Grad, Yonatan H; Greene, Casey S; Hajirasouliha, Iman; Jerala, Roman; Kahles, Andre; Kallioniemi, Olli; Keshet, Ayya; Kocarev, Ljupco; Landua, Gregory; Meir, Tomer; Muller, Aline; Nguyen, Long H; Oresic, Matej; Ovchinnikova, Svetlana; Peterson, Hedi; Prodanova, Jana; Rajagopal, Jay; Rätsch, Gunnar; Rossman, Hagai; Rung, Johan; Sboner, Andrea; Sigaras, Alexandros; Spector, Tim; Steinherz, Ron.

Nat Med ; 26(8): 1309, 2020 08.

Article in English | MEDLINE | ID: mdl-32591764

ABSTRACT

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

17.

Analyses of non-coding somatic drivers in 2,658 cancer whole genomes.

Rheinbay, Esther; Nielsen, Morten Muhlig; Abascal, Federico; Wala, Jeremiah A; Shapira, Ofer; Tiao, Grace; Hornshøj, Henrik; Hess, Julian M; Juul, Randi Istrup; Lin, Ziao; Feuerbach, Lars; Sabarinathan, Radhakrishnan; Madsen, Tobias; Kim, Jaegil; Mularoni, Loris; Shuai, Shimin; Lanzós, Andrés; Herrmann, Carl; Maruvka, Yosef E; Shen, Ciyue; Amin, Samirkumar B; Bandopadhayay, Pratiti; Bertl, Johanna; Boroevich, Keith A; Busanovich, John; Carlevaro-Fita, Joana; Chakravarty, Dimple; Chan, Calvin Wing Yiu; Craft, David; Dhingra, Priyanka; Diamanti, Klev; Fonseca, Nuno A; Gonzalez-Perez, Abel; Guo, Qianyun; Hamilton, Mark P; Haradhvala, Nicholas J; Hong, Chen; Isaev, Keren; Johnson, Todd A; Juul, Malene; Kahles, Andre; Kahraman, Abdullah; Kim, Youngwook; Komorowski, Jan; Kumar, Kiran; Kumar, Sushant; Lee, Donghoon; Lehmann, Kjong-Van; Li, Yilong; Liu, Eric Minwei.

Nature ; 578(7793): 102-111, 2020 02.

Article in English | MEDLINE | ID: mdl-32025015

ABSTRACT

The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.

Subject(s)

Genome, Human/genetics , Mutation/genetics , Neoplasms/genetics , DNA Breaks , Databases, Genetic , Gene Expression Regulation, Neoplastic , Genome-Wide Association Study , Humans , INDEL Mutation

18.

Genomic basis for RNA alterations in cancer.

Calabrese, Claudia; Davidson, Natalie R; Demircioglu, Deniz; Fonseca, Nuno A; He, Yao; Kahles, André; Lehmann, Kjong-Van; Liu, Fenglin; Shiraishi, Yuichi; Soulette, Cameron M; Urban, Lara; Greger, Liliana; Li, Siliang; Liu, Dongbing; Perry, Marc D; Xiang, Qian; Zhang, Fan; Zhang, Junjun; Bailey, Peter; Erkek, Serap; Hoadley, Katherine A; Hou, Yong; Huska, Matthew R; Kilpinen, Helena; Korbel, Jan O; Marin, Maximillian G; Markowski, Julia; Nandi, Tannistha; Pan-Hammarström, Qiang; Pedamallu, Chandra Sekhar; Siebert, Reiner; Stark, Stefan G; Su, Hong; Tan, Patrick; Waszak, Sebastian M; Yung, Christina; Zhu, Shida; Awadalla, Philip; Creighton, Chad J; Meyerson, Matthew; Ouellette, B F Francis; Wu, Kui; Yang, Huanming; Brazma, Alvis; Brooks, Angela N; Göke, Jonathan; Rätsch, Gunnar; Schwarz, Roland F; Stegle, Oliver; Zhang, Zemin.

Nature ; 578(7793): 129-136, 2020 02.

Article in English | MEDLINE | ID: mdl-32025019

ABSTRACT

Transcript alterations often result from somatic changes in cancer genomes1. Various forms of RNA alterations have been described in cancer, including overexpression2, altered splicing3 and gene fusions4; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)5. Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed 'bridged' fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.

Subject(s)

Gene Expression Regulation, Neoplastic , Neoplasms/genetics , RNA/genetics , DNA Copy Number Variations , DNA, Neoplasm , Genome, Human , Genomics , Humans , Transcriptome

19.

Sparse Binary Relation Representations for Genome Graph Annotation.

Karasikov, Mikhail; Mustafa, Harun; Joudaki, Amir; Javadzadeh-No, Sara; Rätsch, Gunnar; Kahles, André.

J Comput Biol ; 27(4): 626-639, 2020 04.

Article in English | MEDLINE | ID: mdl-31891531

ABSTRACT

High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.

Subject(s)

Genome/genetics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Computational Biology , Data Compression , Humans , Molecular Sequence Annotation/methods

20.

NGS-Based S. aureus Typing and Outbreak Analysis in Clinical Microbiology Laboratories: Lessons Learned From a Swiss-Wide Proficiency Test.

Dylus, David; Pillonel, Trestan; Opota, Onya; Wüthrich, Daniel; Seth-Smith, Helena M B; Egli, Adrian; Leo, Stefano; Lazarevic, Vladimir; Schrenzel, Jacques; Laurent, Sacha; Bertelli, Claire; Blanc, Dominique S; Neuenschwander, Stefan; Ramette, Alban; Falquet, Laurent; Imkamp, Frank; Keller, Peter M; Kahles, Andre; Oberhaensli, Simone; Barbié, Valérie; Dessimoz, Christophe; Greub, Gilbert; Lebrand, Aitana.

Front Microbiol ; 11: 591093, 2020.

Article in English | MEDLINE | ID: mdl-33424794

ABSTRACT

Whole genome sequencing (WGS) enables high resolution typing of bacteria up to the single nucleotide polymorphism (SNP) level. WGS is used in clinical microbiology laboratories for infection control, molecular surveillance and outbreak analyses. Given the large palette of WGS reagents and bioinformatics tools, the Swiss clinical bacteriology community decided to conduct a ring trial (RT) to foster harmonization of NGS-based bacterial typing. The RT aimed at assessing methicillin-susceptible Staphylococcus aureus strain relatedness from WGS and epidemiological data. The RT was designed to disentangle the variability arising from differences in sample preparation, SNP calling and phylogenetic methods. Nine laboratories participated. The resulting phylogenetic tree and cluster identification were highly reproducible across the laboratories. Cluster interpretation was, however, more laboratory dependent, suggesting that an increased sharing of expertise across laboratories would contribute to further harmonization of practices. More detailed bioinformatic analyses unveiled that while similar clusters were found across laboratories, these were actually based on different sets of SNPs, differentially retained after sample preparation and SNP calling procedures. Despite this, the observed number of SNP differences between pairs of strains, an important criterion to determine strain relatedness given epidemiological information, was similar across pipelines for closely related strains when restricting SNP calls to a common core genome defined by S. aureus cgMLST schema. The lessons learned from this pilot study will serve the implementation of larger-scale RT, as a mean to have regular external quality assessments for laboratories performing WGS analyses in a clinical setting.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL