Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Bioinformatics ; 39(9)2023 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-37688560

RESUMO

MOTIVATION: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as µ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the µ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, µ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. µ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.


Assuntos
Bancos de Espécimes Biológicos , Haplótipos , Sequenciamento Completo do Genoma , Reino Unido
2.
Bioinformatics ; 38(Suppl 1): i177-i184, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758776

RESUMO

MOTIVATION: Bait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ('baits') are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. RESULTS: We formalize the problem of designing baits by defining the Minimum Bait Cover problem, show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that Syotti requires only 25 min to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 17% of the data in 72 h. AVAILABILITY AND IMPLEMENTATION: https://github.com/jnalanko/syotti. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , DNA , Humanos , Metagenômica/métodos , Análise de Sequência de DNA/métodos
3.
Bioinformatics ; 34(24): 4189-4195, 2018 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-29939217

RESUMO

Motivation: The de Bruijn graph is fundamental to the analysis of next generation sequencing data and so, as datasets of DNA reads grow rapidly, it becomes more important to represent de Bruijn graphs compactly while still supporting fast assembly. Previous implementations of compact de Bruijn graphs have not supported node or edge deletion, however, which is important for pruning spurious elements from the graph. Results: Belazzougui et al. (2016b) recently proposed a compact and fully dynamic representation, which supports exact membership queries and insertions and deletions of both nodes and edges. In this paper, we give a practical implementation of their data structure, supporting exact membership queries and fully dynamic edge operations, as well as limited support for dynamic node operations. We demonstrate experimentally that its performance is comparable to that of state-of-the-art implementations based on Bloom filters. Availability and implementation: Our source-code is publicly available at https://github.com/csirac/dynamicDBG under an open-source license.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Biologia Computacional , Deleção de Sequência
4.
Bioinformatics ; 33(20): 3181-3187, 2017 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-28200001

RESUMO

MOTIVATION: In 2012, Iqbal et al. introduced the colored de Bruijn graph, a variant of the classic de Bruijn graph, which is aimed at 'detecting and genotyping simple and complex genetic variants in an individual or population'. Because they are intended to be applied to massive population level data, it is essential that the graphs be represented efficiently. Unfortunately, current succinct de Bruijn graph representations are not directly applicable to the colored de Bruijn graph, which requires additional information to be succinctly encoded as well as support for non-standard traversal operations. RESULTS: Our data structure dramatically reduces the amount of memory required to store and use the colored de Bruijn graph, with some penalty to runtime, allowing it to be applied in much larger and more ambitious sequence projects than was previously possible. AVAILABILITY AND IMPLEMENTATION: https://github.com/cosmo-team/cosmo/tree/VARI. CONTACT: martin.muggli@colostate.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Técnicas de Genotipagem/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Bactérias/genética , Eucariotos/genética
5.
Theor Comput Sci ; 698: 67-78, 2017 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-29276331

RESUMO

The famous Burrows-Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more. We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly. We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs.

6.
Front Bioinform ; 4: 1391086, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39011297

RESUMO

We generalize a problem of finding maximum-scoring segment sets, previously studied by Csurös (IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1, 139-150), from sequences to graphs. Namely, given a vertex-weighted graph G and a non-negative startup penalty c, we can find a set of vertex-disjoint paths in G with maximum total score when each path's score is its vertices' total weight minus c. We call this new problem maximum-scoring path sets (MSPS). We present an algorithm that has a linear-time complexity for graphs with a constant treewidth. Generalization from sequences to graphs allows the algorithm to be used on pangenome graphs representing several related genomes and can be seen as a common abstraction for several biological problems on pangenomes, including searching for CpG islands, ChIP-seq data analysis, analysis of region enrichment for functional elements, or simple chaining problems.

7.
bioRxiv ; 2024 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-37961660

RESUMO

Efficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the "move structure" was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O(r) space and O(1)-time queries, where r is the number of runs in the pangenome BWT. We implemented Movi, an efficient tool for building and querying move-structure pangenome indexes. While the size of the Movi's index is larger than the r-index, it scales at a smaller rate for pangenome references, as its size is exactly proportional to r, the number of runs in the BWT of the reference. Movi can compute sophisticated matching queries needed for classification - such as pseudo-matching lengths and backward search - at least ten times faster than the fastest available methods, and in some cases more than 30-fold faster. Movi achieves this speed by leveraging the move structure's strong locality of reference, incurring close to the minimum possible number of cache misses for queries against large pangenomes. We achieve still further speed improvements by using memory prefetching to attain a degree of latency hiding that would be difficult with other index structures like the r-index. Movi's fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval.

8.
Algorithms Mol Biol ; 19(1): 15, 2024 Apr 10.
Artigo em Inglês | MEDLINE | ID: mdl-38600518

RESUMO

FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing-which takes parameters that let us tune the average length of the phrases-instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for PFP - FM is available at https://github.com/AaronHong1024/afm .

9.
bioRxiv ; 2024 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-38854079

RESUMO

Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

10.
bioRxiv ; 2023 Jan 20.
Artigo em Inglês | MEDLINE | ID: mdl-36712109

RESUMO

Prefix-free parsing is useful for a wide variety of purposes including building the BWT, constructing the suffix array, and supporting compressed suffix tree operations. This linear-time algorithm uses a rolling hash to break an input string into substrings, where the resulting set of unique substrings has the property that none of the substrings' suffixes (of more than a certain length) is a proper prefix of any of the other substrings' suffixes. Hence, the name prefix-free parsing. This set of unique substrings is referred to as the dictionary . The parse is the ordered list of dictionary strings that defines the input string. Prior empirical results demonstrated the size of the parse is more burdensome than the size of the dictionary for large, repetitive inputs. Hence, the question arises as to how the size of the parse can scale satisfactorily with the input. Here, we describe our algorithm, recursive prefix-free parsing , which accomplishes this by computing the prefix-free parse of the parse produced by prefix-free parsing an input string. Although conceptually simple, building the BWT from the parse-of-the-parse and the dictionaries is significantly more challenging. We solve and implement this problem. Our experimental results show that recursive prefix-free parsing is extremely effective in reducing the memory needed to build the run-length encoded BWT of the input. Our implementation is open source and available at https://github.com/marco-oliva/r-pfbwt .

11.
Genome Biol ; 24(1): 122, 2023 05 18.
Artigo em Inglês | MEDLINE | ID: mdl-37202771

RESUMO

Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2's index is 65 times smaller than minimap2's for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.


Assuntos
Algoritmos , Genômica , Metagenômica , Bases de Dados Factuais , Análise de Sequência de DNA
12.
Res Sq ; 2023 Oct 30.
Artigo em Inglês | MEDLINE | ID: mdl-37961504

RESUMO

FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing-which takes parameters that let us tune the average length of the phrases-instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. The source code for PFP-FM is available at https://github.com/marco-oliva/afm.

13.
Proc Data Compress Conf ; 2023: 150-159, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38832320

RESUMO

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.

14.
Proc Data Compress Conf ; 2023: 268-277, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38818281

RESUMO

MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these queries which significantly speeds up MONI in practice while only slightly increasing its size.

15.
J Comput Biol ; 29(2): 169-187, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35041495

RESUMO

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.


Assuntos
Algoritmos , Genômica/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Bacteriano , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Salmonella/genética , Análise de Sequência de DNA/estatística & dados numéricos , Análise de Ondaletas
16.
J Comput Biol ; 29(2): 188-194, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35041518

RESUMO

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.


Assuntos
Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Humano , Genômica/estatística & dados numéricos , Humanos , Análise de Sequência de DNA/estatística & dados numéricos
17.
Proc Data Compress Conf ; 2022: 93-102, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38812828

RESUMO

Generating pangenomic datasets is becoming increasingly common but there are still few tools able to handle them and even fewer accessible to non-specialists. Building compressed suffix trees (CSTs) for pangenomic datasets is still a major challenge but could be enormously beneficial to the community. In this paper, we present a method, which we refer to as RePFP-CST, for building CSTs in a manner that is scalable. To accomplish this, we show how to build a CST directly from VCF files without decompressing them, and to prune from the prefix-free parse (PFP) phrase boundaries whose removal reduces the total size of the dictionary and the parse. We show that these improvements reduce the time and space required for the construction of the CST, and the memory footprint of the finished CST, enabling us to build a CST for a terabyte of DNA for the first time in the literature.

18.
Comput Struct Biotechnol J ; 19: 4067-4078, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34377371

RESUMO

MOTIVATION: The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the late 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space and time efficient manner. RESULTS: With the exception of a few data structures (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al.,2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, meaning that data can be added or deleted. The most recent compressed dynamic de Bruijn graph (Alipanahi et al., 2020a), relies on dynamic bit vectors which are slow in theory and practice. To address this shortcoming, we present a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude.

19.
iScience ; 24(6): 102696, 2021 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-34195571

RESUMO

Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject "nontarget" DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing using efficient pan-genome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI's index and peak memory footprint are also 16 to 4 times smaller than those of minimap2, respectively. This could enable accurate targeted sequencing even when the targeted strains have not necessarily been sequenced or assembled previously.

20.
Proc Worksh Algorithm Eng Exp ; 2021: 60-72, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-35355938

RESUMO

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT(S) can be computed from D and P in time and workspace bounded in terms of their combined size |PFP(S)|. In practice D and P are significantly smaller than S and computing BWT(S) from them is more efficient than computing it from S directly, at least when S is the concatenation of many genomes. In this paper, we consider PFP(S) as a data structure and show how it can be augmented to support full suffix tree functionality, still built and fitting within O(|PFP(S)|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in S; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA