Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Genome Res ; 33(7): 1154-1161, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37558282

RESUMO

Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos
2.
PLoS Comput Biol ; 18(10): e1010638, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-36306319

RESUMO

MOTIVATION: Sequencing long reads presents novel challenges to mapping. One such challenge is low sequence similarity between the reads and the reference, due to high sequencing error and mutation rates. This occurs, e.g., in a cancer tumor, or due to differences between strains of viruses or bacteria. A key idea in mapping algorithms is to sketch sequences with their minimizers. Recently, syncmers were introduced as an alternative sketching method that is more robust to mutations and sequencing errors. RESULTS: We introduce parameterized syncmer schemes (PSS), a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining PSS with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of PSS in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long-read data from a variety of genomes, the PSS-based algorithms, with scheme parameters selected on the basis of our theoretical analysis, reduced unmapped reads by 20-60% at high compression while usually using less memory. The advantage was more pronounced at low sequence identity. At sequence identity of 75% and medium compression, PSS-minimap had only 37% as many unmapped reads, and 8% fewer of the reads that did map were incorrectly mapped. Even at lower compression and error rates, PSS-based mapping mapped more reads than the original minimizer-based mappers as well as mappers using the original syncmer schemes. We conclude that using PSS can improve mapping of long reads in a wide range of settings.


Assuntos
Compressão de Dados , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Compressão de Dados/métodos , Algoritmos
3.
PLoS Comput Biol ; 16(4): e1007781, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32243433

RESUMO

Many bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice. We present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. We tested PlasClass sequence classification on held-out data and simulations, as well as publicly available bacterial isolates and plasmidome samples and plasmids assembled from metagenomic samples. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, allowing it to achieve higher F1 scores in classifying sequences from a wide range of datasets. PlasClass also uses significantly less time and memory. PlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available under the MIT license from: https://github.com/Shamir-Lab/PlasClass.


Assuntos
DNA , Plasmídeos , Análise de Sequência de DNA/métodos , Software , Biologia Computacional/métodos , DNA/classificação , DNA/genética , DNA Bacteriano/classificação , DNA Bacteriano/genética , Genoma Bacteriano/genética , Plasmídeos/classificação , Plasmídeos/genética
4.
Bioinformatics ; 33(14): i110-i117, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881970

RESUMO

MOTIVATION: The minimizers scheme is a method for selecting k -mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k -mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k -mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues. RESULTS: We provide an in-depth analysis of the effect of k -mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al. ) on the expected density of minimizers in a random sequence. AVAILABILITY AND IMPLEMENTATION: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git . CONTACT: gmarcais@cs.cmu.edu or carlk@cs.cmu.edu.


Assuntos
Genoma Humano , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos
5.
PLoS Comput Biol ; 13(10): e1005777, 2017 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-28968408

RESUMO

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.


Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma Bacteriano , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Caenorhabditis elegans/genética , Heurística Computacional , Humanos
6.
Nat Commun ; 15(1): 3147, 2024 Apr 11.
Artigo em Inglês | MEDLINE | ID: mdl-38605009

RESUMO

Plasmids are pivotal in driving bacterial evolution through horizontal gene transfer. Here, we investigated 3467 human gut microbiome samples across continents and disease states, analyzing 11,086 plasmids. Our analyses reveal that plasmid dispersal is predominantly stochastic, indicating neutral processes as the primary driver of their wide distribution. We find that only 20-25% of plasmid DNA is being selected in various disease states, constraining its distribution across hosts. Selective pressures shape specific plasmid segments with distinct ecological functions, influenced by plasmid mobilization lifestyle, antibiotic usage, and inflammatory gut diseases. Notably, these elements are more commonly shared within groups of individuals with similar health conditions, such as Inflammatory Bowel Disease (IBD), regardless of geographic location across continents. These segments contain essential genes such as iron transport mechanisms- a distinctive gut signature of IBD that impacts the severity of inflammation. Our findings shed light on mechanisms driving plasmid dispersal and selection in the human gut, highlighting their role as carriers of vital gene pools impacting bacterial hosts and ecosystem dynamics.


Assuntos
Ecossistema , Doenças Inflamatórias Intestinais , Humanos , Plasmídeos/genética , Bactérias/genética , Antibacterianos , Transferência Genética Horizontal , Doenças Inflamatórias Intestinais/genética
7.
J Comput Biol ; 29(8): 825-838, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35527644

RESUMO

The rapid continuous growth of deep sequencing experiments requires development and improvement of many bioinformatic applications for analysis of large sequencing data sets, including k-mer counting and assembly. Several applications reduce memory usage by binning sequences. Binning is done by using minimizer schemes, which rely on a specific order of the minimizers. It has been demonstrated that the choice of the order has a major impact on the performance of the applications. Here we introduce a method for tailoring the order to the data set. Our method repeatedly samples the data set and modifies the order so as to flatten the k-mer load distribution across minimizers. We integrated our method into Gerbil, a state-of-the-art memory-efficient k-mer counter, and were able to reduce its memory footprint by 30%-50% for large k, with only a minor increase in runtime. Our tests also showed that the orders produced by our method produced superior results when transferred across data sets from the same species, with little or no order change. This enables memory reduction with essentially no increase in runtime.


Assuntos
Algoritmos , Software , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos
8.
Microbiome ; 9(1): 144, 2021 06 25.
Artigo em Inglês | MEDLINE | ID: mdl-34172093

RESUMO

BACKGROUND: Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. RESULTS: We developed SCAPP (Sequence Contents-Aware Plasmid Peeler)-an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created plasmidome and metagenome data from the same cow rumen sample and used the parallel sequencing data to create a novel assessment procedure. Overall, SCAPP outperformed Recycler and metaplasmidSPAdes across this wide range of datasets. CONCLUSIONS: SCAPP is an easy to use Python package that enables the assembly of full plasmid sequences from metagenomic samples. It outperformed existing metagenomic plasmid assemblers in most cases and assembled novel and clinically relevant plasmids in samples we generated such as a human gut plasmidome. SCAPP is open-source software available from: https://github.com/Shamir-Lab/SCAPP . Video abstract.


Assuntos
Metagenoma , Metagenômica , Algoritmos , Humanos , Plasmídeos/genética , Análise de Sequência de DNA , Software
9.
J Comput Biol ; 24(6): 547-557, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-27828710

RESUMO

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.


Assuntos
Algoritmos , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Simulação por Computador , Humanos , Probabilidade , Software
11.
Annu Rev Public Health ; 27: 103-24, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-16533111

RESUMO

In this review, we provide an introduction to the topics of environmental justice and environmental inequality. We provide an overview of the dimensions of unequal exposures to environmental pollution (environmental inequality), followed by a discussion of the theoretical literature that seeks to explain the origins of this phenomenon. We also consider the impact of the environmental justice movement in the United States and the role that federal and state governments have developed to address environmental inequalities. We conclude that more research is needed that links environmental inequalities with public health outcomes.


Assuntos
Saúde Ambiental/ética , Poluição Ambiental/ética , Preconceito , Justiça Social , Saúde Ambiental/economia , Poluição Ambiental/economia , Governo , Nível de Saúde , Humanos , Classe Social , Fatores Socioeconômicos , Estados Unidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA