Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 37
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 40(5)2024 May 02.
Article in English | MEDLINE | ID: mdl-38603597

ABSTRACT

MOTIVATION: The Oxford Nanopore Technologies (ONT) ReadUntil API enables selective sequencing, which aims to selectively favor interesting over uninteresting reads, e.g. to deplete or enrich certain genomic regions. The performance gain depends on the selective sequencing decision-making algorithm (SSDA) which decides whether to reject a read, stop receiving a read, or wait for more data. Since real runs are time-consuming and costly, simulating the ONT sequencer with support for the ReadUntil API is highly beneficial for comparing and optimizing new SSDAs. Existing software like MinKNOW and UNCALLED only return raw signal data, are memory-intensive, require huge and often unavailable multi-fast5 files (≥100GB) and are not clearly documented. RESULTS: We present the ONT device simulator SimReadUntil that takes a set of full reads as input, distributes them to channels and plays them back in real time including mux scans, channel gaps and blockages, and allows to reject reads as well as stop receiving data from them. Our modified ReadUntil API provides the basecalled reads rather than the raw signal, reducing computational load and focusing on the SSDA rather than on basecalling. Tuning the parameters of tools like ReadFish and ReadBouncer becomes easier because a GPU for basecalling is no longer required. We offer various methods to extract simulation parameters from a sequencing summary file and adapt ReadFish to replicate one of their enrichment experiments. SimReadUntil's gRPC interface allows standardized interaction with a wide range of programming languages. AVAILABILITY AND IMPLEMENTATION: Code and fully worked examples are available on GitHub (https://github.com/ratschlab/sim_read_until).


Subject(s)
Algorithms , Benchmarking , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Nanopore Sequencing/methods
2.
Elife ; 122023 10 12.
Article in English | MEDLINE | ID: mdl-37823551

ABSTRACT

The splicing factor SF3B1 is recurrently mutated in various tumors, including pancreatic ductal adenocarcinoma (PDAC). The impact of the hotspot mutation SF3B1K700E on the PDAC pathogenesis, however, remains elusive. Here, we demonstrate that Sf3b1K700E alone is insufficient to induce malignant transformation of the murine pancreas, but that it increases aggressiveness of PDAC if it co-occurs with mutated KRAS and p53. We further show that Sf3b1K700E already plays a role during early stages of pancreatic tumor progression and reduces the expression of TGF-ß1-responsive epithelial-mesenchymal transition (EMT) genes. Moreover, we found that SF3B1K700E confers resistance to TGF-ß1-induced cell death in pancreatic organoids and cell lines, partly mediated through aberrant splicing of Map3k7. Overall, our findings demonstrate that SF3B1K700E acts as an oncogenic driver in PDAC, and suggest that it promotes the progression of early stage tumors by impeding the cellular response to tumor suppressive effects of TGF-ß.


Subject(s)
Carcinoma, Pancreatic Ductal , Pancreatic Neoplasms , Animals , Humans , Mice , Carcinoma, Pancreatic Ductal/pathology , Cell Line, Tumor , Mutation , Pancreatic Ducts/metabolism , Pancreatic Neoplasms/pathology , Phosphoproteins/metabolism , RNA Splicing Factors/metabolism , Transcription Factors/metabolism , Transforming Growth Factor beta1/metabolism , Pancreatic Neoplasms
3.
Genome Res ; 33(7): 1208-1217, 2023 07.
Article in English | MEDLINE | ID: mdl-37072187

ABSTRACT

Sequence-to-graph alignment is crucial for applications such as variant genotyping, read error correction, and genome assembly. We propose a novel seeding approach that relies on long inexact matches rather than short exact matches, and show that it yields a better time-accuracy trade-off in settings with up to a [Formula: see text] mutation rate. We use sketches of a subset of graph nodes, which are more robust to indels, and store them in a k-nearest neighbor index to avoid the curse of dimensionality. Our approach contrasts with existing methods and highlights the important role that sketching into vector space can play in bioinformatics applications. We show that our method scales to graphs with 1 billion nodes and has quasi-logarithmic query time for queries with an edit distance of [Formula: see text] For such queries, longer sketch-based seeds yield a [Formula: see text] increase in recall compared with exact seeds. Our approach can be incorporated into other aligners, providing a novel direction for sequence-to-graph alignment.


Subject(s)
Algorithms , Computational Biology , Computational Biology/methods , Sequence Alignment , Sequence Analysis, DNA/methods
6.
iScience ; 25(11): 104993, 2022 Nov 18.
Article in English | MEDLINE | ID: mdl-36299999

ABSTRACT

The MetaSUB Consortium, founded in 2015, is a global consortium with an interdisciplinary team of clinicians, scientists, bioinformaticians, engineers, and designers, with members from more than 100 countries across the globe. This network has continually collected samples from urban and rural sites including subways and transit systems, sewage systems, hospitals, and other environmental sampling. These collections have been ongoing since 2015 and have continued when possible, even throughout the COVID-19 pandemic. The consortium has optimized their workflow for the collection, isolation, and sequencing of DNA and RNA collected from these various sites and processing them for metagenomics analysis, including the identification of SARS-CoV-2 and its variants. Here, the Consortium describes its foundations, and its ongoing work to expand on this network and to focus its scope on the mapping, annotation, and prediction of emerging pathogens, mapping microbial evolution and antibiotic resistance, and the discovery of novel organisms and biosynthetic gene clusters.

7.
J Comput Biol ; 29(8): 857-866, 2022 08.
Article in English | MEDLINE | ID: mdl-35776515

ABSTRACT

With the constant increase of large-scale genomic data projects, automated and high-throughput quality assessment becomes a crucial component of any analysis. Whereas small projects often have a more homogeneous design and a manageable structure allowing for a manual per-sample analysis of quality, large-scale studies tend to be much more heterogeneous and complex. Many quality metrics have been developed to assess the quality of an individual sample on the raw read level. Degradation effects are typically assessed based on the RNA integrity (RIN) score, or on postalignment data. In this study, we show that single commonly used quality criteria such as the RIN score alone are not sufficient to ensure RNA sample quality. We developed a new approach and provide an efficient tool that estimates RNA sample degradation by computing the 5'/3' bias based on all genes in an alignment-free manner. That enables degradation assessment right after data generation and not during the analysis procedure allowing for early intervention in the sample handling process. Our analysis shows that this strategy is fast, robust to annotation and differences in library size, and provides complementary quality information to RIN scores enabling the accurate identification of degraded samples.


Subject(s)
RNA Stability , RNA , Genomics , RNA/chemistry , RNA/genetics , Sequence Analysis, RNA/methods
8.
Bioinformatics ; 38(18): 4293-4300, 2022 09 15.
Article in English | MEDLINE | ID: mdl-35900151

ABSTRACT

MOTIVATION: Several recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing. RESULTS: We developed Single Cell Data Tumor Clusterer (SECEDO, lat. 'to separate'), a new method to cluster tumor cells based solely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. We applied SECEDO to a synthetic dataset simulating 7250 cells and eight tumor subclones from a single patient and were able to accurately reconstruct the clonal composition, detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the total population. When applied to five real single-cell sequencing datasets from a breast cancer patient, each consisting of ≈2000 cells, SECEDO was able to recover the major clonal composition in each dataset at the original coverage of 0.03×, achieving an Adjusted Rand Index (ARI) score of ≈0.6. The current state-of-the-art SNV-based clustering method achieved an ARI score of ≈0, even after merging cells to create higher coverage data (factor 10 increase), and was only able to match SECEDOs performance when pooling data from all five datasets, in addition to artificially increasing the sequencing coverage by a factor of 7. Variant calling on the resulting clusters recovered more than twice as many SNVs as would have been detected if calling on all cells together. Further, the allelic ratio of the called SNVs on each subcluster was more than double relative to the allelic ratio of the SNVs called without clustering, thus demonstrating that calling variants on subclones, in addition to both increasing sensitivity of SNV detection and attaching SNVs to subclones, significantly increases the confidence of the called variants. AVAILABILITY AND IMPLEMENTATION: SECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo. Instructions to download the data and the evaluation code to reproduce the findings in this paper are available at: https://github.com/ratschlab/secedo-evaluation. The code and data of the submitted version are archived at: https://doi.org/10.5281/zenodo.6516955. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
High-Throughput Nucleotide Sequencing , Neoplasms , Humans , High-Throughput Nucleotide Sequencing/methods , Bayes Theorem , Sequence Analysis, DNA , Genome , Base Sequence , Neoplasms/genetics , Polymorphism, Single Nucleotide
9.
Nature ; 607(7917): 111-118, 2022 07.
Article in English | MEDLINE | ID: mdl-35732736

ABSTRACT

Natural microbial communities are phylogenetically and metabolically diverse. In addition to underexplored organismal groups1, this diversity encompasses a rich discovery potential for ecologically and biotechnologically relevant enzymes and biochemical compounds2,3. However, studying this diversity to identify genomic pathways for the synthesis of such compounds4 and assigning them to their respective hosts remains challenging. The biosynthetic potential of microorganisms in the open ocean remains largely uncharted owing to limitations in the analysis of genome-resolved data at the global scale. Here we investigated the diversity and novelty of biosynthetic gene clusters in the ocean by integrating around 10,000 microbial genomes from cultivated and single cells with more than 25,000 newly reconstructed draft genomes from more than 1,000 seawater samples. These efforts revealed approximately 40,000 putative mostly new biosynthetic gene clusters, several of which were found in previously unsuspected phylogenetic groups. Among these groups, we identified a lineage rich in biosynthetic gene clusters ('Candidatus Eudoremicrobiaceae') that belongs to an uncultivated bacterial phylum and includes some of the most biosynthetically diverse microorganisms in this environment. From these, we characterized the phospeptin and pythonamide pathways, revealing cases of unusual bioactive compound structure and enzymology, respectively. Together, this research demonstrates how microbiomics-driven strategies can enable the investigation of previously undescribed enzymes and natural products in underexplored microbial groups and environments.


Subject(s)
Biosynthetic Pathways , Microbiota , Oceans and Seas , Bacteria/classification , Bacteria/genetics , Biosynthetic Pathways/genetics , Genomics , Microbiota/genetics , Multigene Family/genetics , Phylogeny
10.
Methods Mol Biol ; 2493: 167-193, 2022.
Article in English | MEDLINE | ID: mdl-35751815

ABSTRACT

Alternative splicing (AS) is a regulatory process during mRNA maturation that shapes higher eukaryotes' complex transcriptomes. High-throughput sequencing of RNA (RNA-Seq) allows for measurements of AS transcripts at an unprecedented depth and diversity. The ever-expanding catalog of known AS events provides biological insights into gene regulation, population genetics, or in the context of disease. Here, we present an overview on the usage of SplAdder, a graph-based alternative splicing toolbox, which can integrate an arbitrarily large number of RNA-Seq alignments and a given annotation file to augment the shared annotation based on RNA-Seq evidence. The shared augmented annotation graph is then used to identify, quantify, and confirm alternative splicing events based on the RNA-Seq data. Splice graphs for individual alignments can also be tested for significant quantitative differences between other samples or groups of samples.


Subject(s)
Alternative Splicing , RNA , High-Throughput Nucleotide Sequencing , RNA/genetics , RNA-Seq , Sequence Analysis, RNA
11.
Genome Res ; 2022 May 24.
Article in English | MEDLINE | ID: mdl-35609994

ABSTRACT

Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.

12.
Bioinformatics ; 37(Suppl_1): i169-i176, 2021 07 12.
Article in English | MEDLINE | ID: mdl-34252940

ABSTRACT

MOTIVATION: Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. RESULTS: In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. AVAILABILITY AND IMPLEMENTATION: RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.


Subject(s)
Algorithms , Biomedical Research , Software
15.
Nature ; 578(7793): 102-111, 2020 02.
Article in English | MEDLINE | ID: mdl-32025015

ABSTRACT

The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.


Subject(s)
Genome, Human/genetics , Mutation/genetics , Neoplasms/genetics , DNA Breaks , Databases, Genetic , Gene Expression Regulation, Neoplastic , Genome-Wide Association Study , Humans , INDEL Mutation
16.
Nature ; 578(7793): 129-136, 2020 02.
Article in English | MEDLINE | ID: mdl-32025019

ABSTRACT

Transcript alterations often result from somatic changes in cancer genomes1. Various forms of RNA alterations have been described in cancer, including overexpression2, altered splicing3 and gene fusions4; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)5. Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed 'bridged' fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.


Subject(s)
Gene Expression Regulation, Neoplastic , Neoplasms/genetics , RNA/genetics , DNA Copy Number Variations , DNA, Neoplasm , Genome, Human , Genomics , Humans , Transcriptome
17.
J Comput Biol ; 27(4): 626-639, 2020 04.
Article in English | MEDLINE | ID: mdl-31891531

ABSTRACT

High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.


Subject(s)
Genome/genetics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Computational Biology , Data Compression , Humans , Molecular Sequence Annotation/methods
18.
Front Microbiol ; 11: 591093, 2020.
Article in English | MEDLINE | ID: mdl-33424794

ABSTRACT

Whole genome sequencing (WGS) enables high resolution typing of bacteria up to the single nucleotide polymorphism (SNP) level. WGS is used in clinical microbiology laboratories for infection control, molecular surveillance and outbreak analyses. Given the large palette of WGS reagents and bioinformatics tools, the Swiss clinical bacteriology community decided to conduct a ring trial (RT) to foster harmonization of NGS-based bacterial typing. The RT aimed at assessing methicillin-susceptible Staphylococcus aureus strain relatedness from WGS and epidemiological data. The RT was designed to disentangle the variability arising from differences in sample preparation, SNP calling and phylogenetic methods. Nine laboratories participated. The resulting phylogenetic tree and cluster identification were highly reproducible across the laboratories. Cluster interpretation was, however, more laboratory dependent, suggesting that an increased sharing of expertise across laboratories would contribute to further harmonization of practices. More detailed bioinformatic analyses unveiled that while similar clusters were found across laboratories, these were actually based on different sets of SNPs, differentially retained after sample preparation and SNP calling procedures. Despite this, the observed number of SNP differences between pairs of strains, an important criterion to determine strain relatedness given epidemiological information, was similar across pipelines for closely related strains when restricting SNP calls to a common core genome defined by S. aureus cgMLST schema. The lessons learned from this pilot study will serve the implementation of larger-scale RT, as a mean to have regular external quality assessments for laboratories performing WGS analyses in a clinical setting.

19.
Cell ; 178(6): 1465-1477.e17, 2019 09 05.
Article in English | MEDLINE | ID: mdl-31491388

ABSTRACT

Most human protein-coding genes are regulated by multiple, distinct promoters, suggesting that the choice of promoter is as important as its level of transcriptional activity. However, while a global change in transcription is recognized as a defining feature of cancer, the contribution of alternative promoters still remains largely unexplored. Here, we infer active promoters using RNA-seq data from 18,468 cancer and normal samples, demonstrating that alternative promoters are a major contributor to context-specific regulation of transcription. We find that promoters are deregulated across tissues, cancer types, and patients, affecting known cancer genes and novel candidates. For genes with independently regulated promoters, we demonstrate that promoter activity provides a more accurate predictor of patient survival than gene expression. Our study suggests that a dynamic landscape of active promoters shapes the cancer transcriptome, opening new diagnostic avenues and opportunities to further explore the interplay of regulatory mechanisms with transcriptional aberrations in cancer.


Subject(s)
Computational Biology/methods , Gene Expression Regulation, Neoplastic/genetics , Neoplasms/genetics , Promoter Regions, Genetic/genetics , Transcriptome/genetics , Databases, Genetic , Humans , RNA-Seq/methods
20.
Bioinformatics ; 35(3): 407-414, 2019 02 01.
Article in English | MEDLINE | ID: mdl-30020403

ABSTRACT

Motivation: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability and implementation: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Data Compression , Software , Algorithms , Color , Genomics , High-Throughput Nucleotide Sequencing
SELECTION OF CITATIONS
SEARCH DETAIL
...