Pesquisa | Portal Regional da BVS

1.

Hong, Aaron; Oliva, Marco; Köppl, Dominik; Bannai, Hideo; Boucher, Christina; Gagie, Travis.

Algorithms Mol Biol ; 19(1): 15, 2024 Apr 10.

Artigo em Inglês | MEDLINE | ID: mdl-38600518

RESUMO

FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing-which takes parameters that let us tune the average length of the phrases-instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for PFP - FM is available at https://github.com/AaronHong1024/afm .

2.

Movi: a fast and cache-efficient full-text pangenome index.

Zakeri, Mohsen; Brown, Nathaniel K; Ahmed, Omar Y; Gagie, Travis; Langmead, Ben.

bioRxiv ; 2024 Feb 15.

Artigo em Inglês | MEDLINE | ID: mdl-37961660

RESUMO

Efficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the "move structure" was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O(r) space and O(1)-time queries, where r is the number of runs in the pangenome BWT. We implemented Movi, an efficient tool for building and querying move-structure pangenome indexes. While the size of the Movi's index is larger than the r-index, it scales at a smaller rate for pangenome references, as its size is exactly proportional to r, the number of runs in the BWT of the reference. Movi can compute sophisticated matching queries needed for classification - such as pseudo-matching lengths and backward search - at least ten times faster than the fastest available methods, and in some cases more than 30-fold faster. Movi achieves this speed by leveraging the move structure's strong locality of reference, incurring close to the minimum possible number of cache misses for queries against large pangenomes. We achieve still further speed improvements by using memory prefetching to attain a degree of latency hiding that would be difficult with other index structures like the r-index. Movi's fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval.

3.

PFP-FM: An Accelerated FM-index.

Hong, Aaron; Oliva, Marco; Köppl, Dominik; Bannai, Hideo; Boucher, Christina; Gagie, Travis.

Res Sq ; 2023 Oct 30.

Artigo em Inglês | MEDLINE | ID: mdl-37961504

RESUMO

FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing-which takes parameters that let us tune the average length of the phrases-instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. The source code for PFP-FM is available at https://github.com/marco-oliva/afm.

4.

µ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data.

Cozzi, Davide; Rossi, Massimiliano; Rubinacci, Simone; Gagie, Travis; Köppl, Dominik; Boucher, Christina; Bonizzoni, Paola.

Bioinformatics ; 39(9)2023 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-37688560

RESUMO

MOTIVATION: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as µ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the µ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, µ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. µ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.

Assuntos

Bancos de Espécimes Biológicos , Haplótipos , Sequenciamento Completo do Genoma , Reino Unido

5.

SPUMONI 2: improved classification using a pangenome index of minimizer digests.

Ahmed, Omar Y; Rossi, Massimiliano; Gagie, Travis; Boucher, Christina; Langmead, Ben.

Genome Biol ; 24(1): 122, 2023 05 18.

Artigo em Inglês | MEDLINE | ID: mdl-37202771

RESUMO

Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2's index is 65 times smaller than minimap2's for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.

Assuntos

Algoritmos , Genômica , Metagenômica , Bases de Dados Factuais , Análise de Sequência de DNA

6.

Recursive Prefix-Free Parsing for Building Big BWTs.

Oliva, Marco; Gagie, Travis; Boucher, Christina.

bioRxiv ; 2023 Jan 20.

Artigo em Inglês | MEDLINE | ID: mdl-36712109

RESUMO

Prefix-free parsing is useful for a wide variety of purposes including building the BWT, constructing the suffix array, and supporting compressed suffix tree operations. This linear-time algorithm uses a rolling hash to break an input string into substrings, where the resulting set of unique substrings has the property that none of the substrings' suffixes (of more than a certain length) is a proper prefix of any of the other substrings' suffixes. Hence, the name prefix-free parsing. This set of unique substrings is referred to as the dictionary . The parse is the ordered list of dictionary strings that defines the input string. Prior empirical results demonstrated the size of the parse is more burdensome than the size of the dictionary for large, repetitive inputs. Hence, the question arises as to how the size of the parse can scale satisfactorily with the input. Here, we describe our algorithm, recursive prefix-free parsing , which accomplishes this by computing the prefix-free parse of the parse produced by prefix-free parsing an input string. Although conceptually simple, building the BWT from the parse-of-the-parse and the dictionaries is significantly more challenging. We solve and implement this problem. Our experimental results show that recursive prefix-free parsing is extremely effective in reducing the memory needed to build the run-length encoded BWT of the input. Our implementation is open source and available at https://github.com/marco-oliva/r-pfbwt .

7.

Syotti: scalable bait design for DNA enrichment.

Alanko, Jarno N; Slizovskiy, Ilya B; Lokshtanov, Daniel; Gagie, Travis; Noyes, Noelle R; Boucher, Christina.

Bioinformatics ; 38(Suppl 1): i177-i184, 2022 06 24.

Artigo em Inglês | MEDLINE | ID: mdl-35758776

RESUMO

MOTIVATION: Bait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ('baits') are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. RESULTS: We formalize the problem of designing baits by defining the Minimum Bait Cover problem, show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that Syotti requires only 25 min to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 17% of the data in 72 h. AVAILABILITY AND IMPLEMENTATION: https://github.com/jnalanko/syotti. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , DNA , Humanos , Metagenômica/métodos , Análise de Sequência de DNA/métodos

8.

MONI: A Pangenomic Index for Finding Maximal Exact Matches.

Rossi, Massimiliano; Oliva, Marco; Langmead, Ben; Gagie, Travis; Boucher, Christina.

J Comput Biol ; 29(2): 169-187, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-35041495

RESUMO

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.

Assuntos

Algoritmos , Genômica/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Bacteriano , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Salmonella/genética , Análise de Sequência de DNA/estatística & dados numéricos , Análise de Ondaletas

9.

Finding Maximal Exact Matches Using the r-Index.

Rossi, Massimiliano; Oliva, Marco; Bonizzoni, Paola; Langmead, Ben; Gagie, Travis; Boucher, Christina.

J Comput Biol ; 29(2): 188-194, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-35041518

RESUMO

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.

Assuntos

Algoritmos , Alinhamento de Sequência/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Genoma Humano , Genômica/estatística & dados numéricos , Humanos , Análise de Sequência de DNA/estatística & dados numéricos

10.

PHONI: Streamed Matching Statistics with Multi-Genome References.

Boucher, Christina; Gagie, Travis; Tomohiro, I; Köppl, Dominik; Langmead, Ben; Manzini, Giovanni; Navarro, Gonzalo; Pacheco, Alejandro; Rossi, Massimiliano.

Proc Data Compress Conf ; 2021: 193-202, 2021 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-34778549

RESUMO

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. Our code is available at https://github.com/koeppl/phoni.

11.

Buffering updates enables efficient dynamic de Bruijn graphs.

Alanko, Jarno; Alipanahi, Bahar; Settle, Jonathen; Boucher, Christina; Gagie, Travis.

Comput Struct Biotechnol J ; 19: 4067-4078, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34377371

RESUMO

MOTIVATION: The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the late 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space and time efficient manner. RESULTS: With the exception of a few data structures (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al.,2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, meaning that data can be added or deleted. The most recent compressed dynamic de Bruijn graph (Alipanahi et al., 2020a), relies on dynamic bit vectors which are slow in theory and practice. To address this shortcoming, we present a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude.

12.

Pan-genomic matching statistics for targeted nanopore sequencing.

Ahmed, Omar; Rossi, Massimiliano; Kovaka, Sam; Schatz, Michael C; Gagie, Travis; Boucher, Christina; Langmead, Ben.

iScience ; 24(6): 102696, 2021 Jun 25.

Artigo em Inglês | MEDLINE | ID: mdl-34195571

RESUMO

Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject "nontarget" DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing using efficient pan-genome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI's index and peak memory footprint are also 16 to 4 times smaller than those of minimap2, respectively. This could enable accurate targeted sequencing even when the targeted strains have not necessarily been sequenced or assembled previously.

13.

PFP Compressed Suffix Trees.

Boucher, Christina; Cvacho, Ondrej; Gagie, Travis; Holub, Jan; Manzini, Giovanni; Navarro, Gonzalo; Rossi, Massimiliano.

Proc Worksh Algorithm Eng Exp ; 2021: 60-72, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-35355938

RESUMO

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT(S) can be computed from D and P in time and workspace bounded in terms of their combined size |PFP(S)|. In practice D and P are significantly smaller than S and computing BWT(S) from them is more efficient than computing it from S directly, at least when S is the concatenation of many genomes. In this paper, we consider PFP(S) as a data structure and show how it can be augmented to support full suffix tree functionality, still built and fitting within O(|PFP(S)|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in S; reading any cell of its suffix array (SA), of its inverse (ISA), of its BWT, and of its longest common prefix array (LCP); and computing minima over ranges and next/previous smaller value queries over the LCP. Our experimental results show that the PFP suffix tree can be efficiently constructed for very large repetitive datasets and that its operations perform competitively with other compressed suffix trees that can only handle much smaller datasets.

14.

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.

Kuhnle, Alan; Mun, Taher; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni.

J Comput Biol ; 27(4): 500-513, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32181684

RESUMO

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that-when used with the rank data structure-allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.

Assuntos

Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Genoma Humano/genética , Humanos , Análise de Sequência de DNA/métodos

15.

Matching Reads to Many Genomes with the r-Index.

Mun, Taher; Kuhnle, Alan; Boucher, Christina; Gagie, Travis; Langmead, Ben; Manzini, Giovanni.

J Comput Biol ; 27(4): 514-518, 2020 04.

Artigo em Inglês | MEDLINE | ID: mdl-32181686

RESUMO

The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This article shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on an FASTA file to build an r-index for that file; and how to query that index with ri-align.

Assuntos

Genoma/genética , Genômica , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas , Humanos , Alinhamento de Sequência/métodos , Software

16.

Prefix-free parsing for building big BWTs.

Boucher, Christina; Gagie, Travis; Kuhnle, Alan; Langmead, Ben; Manzini, Giovanni; Mun, Taher.

Algorithms Mol Biol ; 14: 13, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31149025

RESUMO

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive-a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.

17.

Practical dynamic de Bruijn graphs.

Crawford, Victoria G; Kuhnle, Alan; Boucher, Christina; Chikhi, Rayan; Gagie, Travis.

Bioinformatics ; 34(24): 4189-4195, 2018 12 15.

Artigo em Inglês | MEDLINE | ID: mdl-29939217

RESUMO

Motivation: The de Bruijn graph is fundamental to the analysis of next generation sequencing data and so, as datasets of DNA reads grow rapidly, it becomes more important to represent de Bruijn graphs compactly while still supporting fast assembly. Previous implementations of compact de Bruijn graphs have not supported node or edge deletion, however, which is important for pruning spurious elements from the graph. Results: Belazzougui et al. (2016b) recently proposed a compact and fully dynamic representation, which supports exact membership queries and insertions and deletions of both nodes and edges. In this paper, we give a practical implementation of their data structure, supporting exact membership queries and fully dynamic edge operations, as well as limited support for dynamic node operations. We demonstrate experimentally that its performance is comparable to that of state-of-the-art implementations based on Bloom filters. Availability and implementation: Our source-code is publicly available at https://github.com/csirac/dynamicDBG under an open-source license.

Assuntos

Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Biologia Computacional , Deleção de Sequência

18.

Relative Suffix Trees.

Farruggia, Andrea; Gagie, Travis; Navarro, Gonzalo; Puglisi, Simon J; Sirén, Jouni.

Comput J ; 61(5): 773-788, 2018 May.

Artigo em Inglês | MEDLINE | ID: mdl-29795706

RESUMO

Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.

19.

Wheeler graphs: A framework for BWT-based data structures.

Gagie, Travis; Manzini, Giovanni; Sirén, Jouni.

Theor Comput Sci ; 698: 67-78, 2017 Oct 25.

Artigo em Inglês | MEDLINE | ID: mdl-29276331

RESUMO

The famous Burrows-Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more. We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly. We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs.

20.

Document retrieval on repetitive string collections.

Gagie, Travis; Hartikainen, Aleksi; Karhu, Kalle; Kärkkäinen, Juha; Navarro, Gonzalo; Puglisi, Simon J; Sirén, Jouni.

Inf Retr Boston ; 20(3): 253-291, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-28596702

RESUMO

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA