Pesquisa | Portal Regional da BVS

Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries.

Mehringer, Svenja; Seiler, Enrico; Droop, Felix; Darvish, Mitra; Rahn, René; Vingron, Martin; Reinert, Knut.

Genome Biol ; 24(1): 131, 2023 05 31.

Artigo em Inglês | MEDLINE | ID: mdl-37259161

RESUMO

We present a novel data structure for searching sequences in large databases: the Hierarchical Interleaved Bloom Filter (HIBF). It is extremely fast and space efficient, yet so general that it could serve as the underlying engine for many applications. We show that the HIBF is superior in build time, index size, and search time while achieving a comparable or better accuracy compared to other state-of-the-art tools. The HIBF builds an index up to 211 times faster, using up to 14 times less space, and can answer approximate membership queries faster by a factor of up to 129.

Assuntos

Algoritmos , Software

Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments.

Darvish, Mitra; Seiler, Enrico; Mehringer, Svenja; Rahn, René; Reinert, Knut.

Bioinformatics ; 38(17): 4100-4108, 2022 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-35801930

RESUMO

MOTIVATION: The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data. RESULTS: As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in <2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query. AVAILABILITY AND IMPLEMENTATION: https://github.com/seqan/needle. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Análise de Sequência de DNA

Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.

Seiler, Enrico; Mehringer, Svenja; Darvish, Mitra; Turc, Etienne; Reinert, Knut.

iScience ; 24(7): 102782, 2021 Jul 23.

Artigo em Inglês | MEDLINE | ID: mdl-34337360

RESUMO

We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.

Piro, Vitor C; Dadi, Temesgen H; Seiler, Enrico; Reinert, Knut; Renard, Bernhard Y.

Bioinformatics ; 36(Suppl_1): i12-i20, 2020 07 01.

Artigo em Inglês | MEDLINE | ID: mdl-32657362

RESUMO

MOTIVATION: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. RESULTS: Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. AVAILABILITY AND IMPLEMENTATION: The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Metagenômica , Archaea , Análise de Sequência de DNA , Software

Where did you come from, where did you go: Refining metagenomic analysis tools for horizontal gene transfer characterisation.

Seiler, Enrico; Trappe, Kathrin; Renard, Bernhard Y.

PLoS Comput Biol ; 15(7): e1007208, 2019 07.

Artigo em Inglês | MEDLINE | ID: mdl-31335917

RESUMO

Horizontal gene transfer (HGT) has changed the way we regard evolution. Instead of waiting for the next generation to establish new traits, especially bacteria are able to take a shortcut via HGT that enables them to pass on genes from one individual to another, even across species boundaries. The tool Daisy offers the first HGT detection approach based on read mapping that provides complementary evidence compared to existing methods. However, Daisy relies on the acceptor and donor organism involved in the HGT being known. We introduce DaisyGPS, a mapping-based pipeline that is able to identify acceptor and donor reference candidates of an HGT event based on sequencing reads. Acceptor and donor identification is akin to species identification in metagenomic samples based on sequencing reads, a problem addressed by metagenomic profiling tools. However, acceptor and donor references have certain properties such that these methods cannot be directly applied. DaisyGPS uses MicrobeGPS, a metagenomic profiling tool tailored towards estimating the genomic distance between organisms in the sample and the reference database. We enhance the underlying scoring system of MicrobeGPS to account for the sequence patterns in terms of mapping coverage of an acceptor and donor involved in an HGT event, and report a ranked list of reference candidates. These candidates can then be further evaluated by tools like Daisy to establish HGT regions. We successfully validated our approach on both simulated and real data, and show its benefits in an investigation of an outbreak involving Methicillin-resistant Staphylococcus aureus data.

Assuntos

Evolução Molecular , Transferência Genética Horizontal , Metagenoma , Metagenômica/métodos , Modelos Genéticos , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Surtos de Doenças/estatística & dados numéricos , Variação Genética , Genoma Bacteriano , Helicobacter pylori/genética , Humanos , Metagenômica/estatística & dados numéricos , Staphylococcus aureus Resistente à Meticilina/genética , Mutação , Infecções Estafilocócicas/epidemiologia , Infecções Estafilocócicas/microbiologia

DREAM-Yara: an exact read mapper for very large databases with short update time.

Dadi, Temesgen Hailemariam; Siragusa, Enrico; Piro, Vitor C; Andrusch, Andreas; Seiler, Enrico; Renard, Bernhard Y; Reinert, Knut.

Bioinformatics ; 34(17): i766-i772, 2018 09 01.

Artigo em Inglês | MEDLINE | ID: mdl-30423080

RESUMO

Motivation: Mapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. >10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in typical metagenomics analysis, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing such index takes about 1 day of computing time. As a result, updates of indices are rarely performed. Hence, it is desirable to create an alternative way of indexing while preserving fast search times. Results: To solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework and provide an implementation. The main contributions are the introduction of an approximate search distributor via a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuilt if parts are updated while maintaining a fast search time. The second main contribution is an implementation of DREAM-Yara a distributed version of a fully sensitive read mapper under the DREAM framework. Availability and implementation: https://gitlab.com/pirovc/dream_yara/.

Assuntos

Bases de Dados Factuais , Software , Humanos , Fatores de Tempo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA