Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 39
Filtrar
1.
Genome Res ; 33(7): 1061-1068, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37344105

RESUMEN

Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances.


Asunto(s)
Evolución Biológica , Tasa de Mutación , Intervalos de Confianza , Metagenoma , Metagenómica/métodos
2.
Nat Methods ; 19(4): 429-440, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35396482

RESUMEN

Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.


Asunto(s)
Metagenoma , Metagenómica , Archaea/genética , Metagenómica/métodos , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN , Programas Informáticos
3.
Bioinformatics ; 40(2)2024 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-38268451

RESUMEN

MOTIVATION: In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. Existing tools generally return point estimates, with no associated confidence or uncertainty associated with it. This has led to practitioners experiencing difficulty when interpreting the results from these tools, particularly for low-abundance organisms as these often reside in the "noisy tail" of incorrect predictions. Furthermore, few tools account for the fact that reference databases are often incomplete and rarely, if ever, contain exact replicas of genomes present in an environmentally derived metagenome. RESULTS: We present solutions for these issues by introducing the algorithm YACHT: Yes/No Answers to Community membership via Hypothesis Testing. This approach introduces a statistical framework that accounts for sequence divergence between the reference and sample genomes, in terms of ANI, as well as incomplete sequencing depth, thus providing a hypothesis test for determining the presence or absence of a reference genome in a sample. After introducing our approach, we quantify its statistical power and how this changes with varying parameters. Subsequently, we perform extensive experiments using both simulated and real data to confirm the accuracy and scalability of this approach. AVAILABILITY AND IMPLEMENTATION: The source code implementing this approach is available via Conda and at https://github.com/KoslickiLab/YACHT. We also provide the code for reproducing experiments at https://github.com/KoslickiLab/YACHT-reproducibles.


Asunto(s)
Metagenoma , Microbiota , Microbiota/genética , Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN/métodos , Metagenómica/métodos
4.
Bioinformatics ; 40(Suppl 2): ii165-ii173, 2024 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-39230701

RESUMEN

MOTIVATION: Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing database is computationally expensive. In general, k-mer-based sketching techniques have been successfully used in metagenomics to address this bottleneck, notably in taxonomic profiling. In this work, we describe leveraging FracMinHash (implemented in sourmash, a publicly available software), a k-mer-sketching algorithm, to obtain functional profiles of metagenome samples. RESULTS: We show how pieces of the sourmash software (and the resulting FracMinHash sketches) can be put together in a pipeline to functionally profile a metagenomic sample. We named our pipeline fmh-funprofiler. We report that the functional profiles obtained using this pipeline demonstrate comparable completeness and better purity compared to the profiles obtained using other alignment-based methods when applied to simulated metagenomic data. We also report that fmh-funprofiler is 39-99× faster in wall-clock time, and consumes up to 40-55× less memory. Coupled with the KEGG database, this method not only replicates fundamental biological insights but also highlights novel signals from the Human Microbiome Project datasets. AVAILABILITY AND IMPLEMENTATION: This fast and lightweight metagenomic functional profiler is freely available and can be accessed here: https://github.com/KoslickiLab/fmh-funprofiler. All scripts of the analyses we present in this manuscript can be found on GitHub.


Asunto(s)
Algoritmos , Metagenoma , Metagenómica , Programas Informáticos , Metagenómica/métodos , Metagenoma/genética , Humanos , Microbiota/genética , Bases de Datos Genéticas
5.
Bioinformatics ; 39(39 Suppl 1): i57-i65, 2023 06 30.
Artículo en Inglés | MEDLINE | ID: mdl-37387190

RESUMEN

MOTIVATION: Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely used metric for measuring the variability between metagenomic samples. We propose that the characterization of metagenomic environments can be improved by finding the average, a.k.a. the barycenter, among the samples with respect to the UniFrac distance. However, it is possible that such a UniFrac-average includes negative entries, making it no longer a valid representation of a metagenomic community. RESULTS: To overcome this intrinsic issue, we propose a special version of the UniFrac metric, termed L2UniFrac, which inherits the phylogenetic nature of the traditional UniFrac and with respect to which one can easily compute the average, producing biologically meaningful environment-specific "representative samples." We demonstrate the usefulness of such representative samples as well as the extended usage of L2UniFrac in efficient clustering of metagenomic samples, and provide mathematical characterizations and proofs to the desired properties of L2UniFrac. AVAILABILITY AND IMPLEMENTATION: A prototype implementation is provided at https://github.com/KoslickiLab/L2-UniFrac.git. All figures, data, and analysis can be reproduced at https://github.com/KoslickiLab/L2-UniFrac-Paper.


Asunto(s)
Metagenoma , Metagenómica , Filogenia , Análisis por Conglomerados
6.
Bioinformatics ; 39(3)2023 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-36752514

RESUMEN

MOTIVATION: With the rapidly growing volume of knowledge and data in biomedical databases, improved methods for knowledge-graph-based computational reasoning are needed in order to answer translational questions. Previous efforts to solve such challenging computational reasoning problems have contributed tools and approaches, but progress has been hindered by the lack of an expressive analysis workflow language for translational reasoning and by the lack of a reasoning engine-supporting that language-that federates semantically integrated knowledge-bases. RESULTS: We introduce ARAX, a new reasoning system for translational biomedicine that provides a web browser user interface and an application programming interface (API). ARAX enables users to encode translational biomedical questions and to integrate knowledge across sources to answer the user's query and facilitate exploration of results. For ARAX, we developed new approaches to query planning, knowledge-gathering, reasoning and result ranking and dynamically integrate knowledge providers for answering biomedical questions. To illustrate ARAX's application and utility in specific disease contexts, we present several use-case examples. AVAILABILITY AND IMPLEMENTATION: The source code and technical documentation for building the ARAX server-side software and its built-in knowledge database are freely available online (https://github.com/RTXteam/RTX). We provide a hosted ARAX service with a web browser interface at arax.rtx.ai and a web API endpoint at arax.rtx.ai/api/arax/v1.3/ui/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Bases del Conocimiento , Programas Informáticos , Bases de Datos Factuales , Lenguaje , Navegador Web
7.
Bioinformatics ; 38(Suppl 1): i28-i35, 2022 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-35758788

RESUMEN

MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Biología Computacional/métodos , Metagenómica , Análisis de Secuencia de ADN/métodos
8.
Bioinformatics ; 38(Suppl 1): i169-i176, 2022 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-35758786

RESUMEN

MOTIVATION: Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. RESULTS: We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool. AVAILABILITY AND IMPLEMENTATION: Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Programas Informáticos
9.
BMC Bioinformatics ; 23(1): 400, 2022 Sep 29.
Artículo en Inglés | MEDLINE | ID: mdl-36175836

RESUMEN

BACKGROUND: Biomedical translational science is increasingly using computational reasoning on repositories of structured knowledge (such as UMLS, SemMedDB, ChEMBL, Reactome, DrugBank, and SMPDB in order to facilitate discovery of new therapeutic targets and modalities. The NCATS Biomedical Data Translator project is working to federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions. Within that project and the broader field, there is a need for a framework that can efficiently and reproducibly build an integrated, standards-compliant, and comprehensive biomedical knowledge graph that can be downloaded in standard serialized form or queried via a public application programming interface (API). RESULTS: To create a knowledge provider system within the Translator project, we have developed RTX-KG2, an open-source software system for building-and hosting a web API for querying-a biomedical knowledge graph that uses an Extract-Transform-Load approach to integrate 70 knowledge sources (including the aforementioned core six sources) into a knowledge graph with provenance information including (where available) citations. The semantic layer and schema for RTX-KG2 follow the standard Biolink model to maximize interoperability. RTX-KG2 is currently being used by multiple Translator reasoning agents, both in its downloadable form and via its SmartAPI-registered interface. Serializations of RTX-KG2 are available for download in both the pre-canonicalized form and in canonicalized form (in which synonyms are merged). The current canonicalized version (KG2.7.3) of RTX-KG2 contains 6.4M nodes and 39.3M edges with a hierarchy of 77 relationship types from Biolink. CONCLUSION: RTX-KG2 is the first knowledge graph that integrates UMLS, SemMedDB, ChEMBL, DrugBank, Reactome, SMPDB, and 64 additional knowledge sources within a knowledge graph that conforms to the Biolink standard for its semantic layer and schema. RTX-KG2 is publicly available for querying via its API at arax.rtx.ai/api/rtxkg2/v1.2/openapi.json . The code to build RTX-KG2 is publicly available at github:RTXteam/RTX-KG2 .


Asunto(s)
Conocimiento , Reconocimiento de Normas Patrones Automatizadas , Semántica , Programas Informáticos , Ciencia Traslacional Biomédica
10.
BMC Biol ; 18(1): 37, 2020 04 07.
Artículo en Inglés | MEDLINE | ID: mdl-32264902

RESUMEN

Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research.


Asunto(s)
Bacterias/genética , Bases de Datos Genéticas/normas , Hongos/genética , Metagenómica/normas , Metagenómica/instrumentación
11.
BMC Biol ; 18(1): 92, 2020 07 28.
Artículo en Inglés | MEDLINE | ID: mdl-32723395

RESUMEN

An amendment to this paper has been published and can be accessed via the original article.

12.
Nat Methods ; 14(11): 1063-1071, 2017 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-28967888

RESUMEN

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


Asunto(s)
Metagenómica , Programas Informáticos , Algoritmos , Benchmarking , Análisis de Secuencia de ADN
13.
BMC Genomics ; 20(Suppl 5): 423, 2019 Jun 06.
Artículo en Inglés | MEDLINE | ID: mdl-31167634

RESUMEN

BACKGROUND: High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Additionally, most work in metagenomics has focused on bacteria and archaea, neglecting to study other key microbes such as viruses and eukaryotes. RESULTS: Here we present a method, MiCoP (Microbiome Community Profiling), that uses fast-mapping of reads to build a comprehensive reference database of full genomes from viruses and eukaryotes to achieve maximum read usage and enable the analysis of the virome and eukaryome in each sample. We demonstrate that mapping of metagenomic reads is feasible for the smaller viral and eukaryotic reference databases. We show that our method is accurate on simulated and mock community data and identifies many more viral and fungal species than previously-reported results on real data from the Human Microbiome Project. CONCLUSIONS: MiCoP is a mapping-based method that proves more effective than existing methods at abundance profiling of viruses and eukaryotes in metagenomic samples. MiCoP can be used to detect the full diversity of these communities. The code, data, and documentation are publicly available on GitHub at: https://github.com/smangul1/MiCoP .


Asunto(s)
Biología Computacional/métodos , Hongos/genética , Marcadores Genéticos , Metagenómica/métodos , Microbiota , Análisis de Secuencia de ADN/métodos , Virus/genética , Algoritmos , Hongos/clasificación , Genoma Fúngico , Genoma Viral , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Virus/clasificación
14.
Bioinformatics ; 34(9): 1514-1521, 2018 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-29236975

RESUMEN

Motivation: Genomic networks represent a complex map of molecular interactions which are descriptive of the biological processes occurring in living cells. Identifying the small over-represented circuitry patterns in these networks helps generate hypotheses about the functional basis of such complex processes. Network motif discovery is a systematic way of achieving this goal. However, a reliable network motif discovery outcome requires generating random background networks which are the result of a uniform and independent graph sampling method. To date, there has been no method to numerically evaluate whether any network motif discovery algorithm performs as intended on realistically sized datasets-thus it was not possible to assess the validity of resulting network motifs. Results: In this work, we present IndeCut, the first method to date that characterizes network motif finding algorithm performance in terms of uniform sampling on realistically sized networks. We demonstrate that it is critical to use IndeCut prior to running any network motif finder for two reasons. First, IndeCut indicates the number of samples needed for a tool to produce an outcome that is both reproducible and accurate. Second, IndeCut allows users to choose the tool that generates samples in the most independent fashion for their network of interest among many available options. Availability and implementation: The open source software package is available at https://github.com/megrawlab/IndeCut. Contact: megrawm@science.oregonstate.edu or david.koslicki@math.oregonstate.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional/métodos , Redes Reguladoras de Genes , Programas Informáticos , Algoritmos , Escherichia coli , Humanos , Factores de Transcripción/metabolismo
15.
J Math Biol ; 77(4): 935-949, 2018 10.
Artículo en Inglés | MEDLINE | ID: mdl-29691633

RESUMEN

Both the weighted and unweighted UniFrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the UniFrac metric is equivalent to the so-called earth mover's distance (also known as the Kantorovich-Rubinstein metric) to develop an algorithm that not only computes the UniFrac distance in linear time and space, but also simultaneously finds which operational taxonomic units are responsible for the observed differences between samples. This allows the algorithm, called EMDUniFrac, to determine why given samples are different, not just if they are different, and with no added computational burden. EMDUniFrac can be utilized on any distribution on a tree, and so is particularly suitable to analyzing both operational taxonomic units derived from amplicon sequencing, as well as community profiles resulting from classifying whole genome shotgun metagenomes. The EMDUniFrac source code (written in python) is freely available at: https://github.com/dkoslicki/EMDUniFrac .


Asunto(s)
Algoritmos , Microbiota , Modelos Biológicos , Genoma Microbiano , Humanos , Modelos Lineales , Conceptos Matemáticos , Metagenoma , Microbiota/genética , Filogenia , Análisis Espacio-Temporal , Factores de Tiempo
16.
J Math Biol ; 76(4): 877-909, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-28735343

RESUMEN

We consider the goal of predicting how complex networks respond to chronic (press) perturbations when characterizations of their network topology and interaction strengths are associated with uncertainty. Our primary result is the derivation of exact formulas for the expected number and probability of qualitatively incorrect predictions about a system's responses under uncertainties drawn form arbitrary distributions of error. Additional indices provide new tools for identifying which links in a network are most qualitatively and quantitatively sensitive to error, and for determining the volume of errors within which predictions will remain qualitatively determinate (i.e. sign insensitive). Together with recent advances in the empirical characterization of uncertainty in networks, these tools bridge a way towards probabilistic predictions of network dynamics.


Asunto(s)
Modelos Biológicos , Biología de Sistemas/estadística & datos numéricos , Animales , Biología Computacional , Ecosistema , Cadena Alimentaria , Humanos , Conceptos Matemáticos , Probabilidad , Incertidumbre
18.
Bioinformatics ; 30(17): 2423-31, 2014 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-24812337

RESUMEN

MOTIVATION: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment. RESULTS: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method. AVAILABILITY AND IMPLEMENTATION: A platform-independent Matlab implementation of the method is freely available at http://www.ee.kth.se/ctsoftware; source code that does not require access to Matlab is currently being tested and will be made available later through the above Web site.


Asunto(s)
Bacterias/clasificación , Metagenómica/métodos , Algoritmos , Bacterias/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Modelos Estadísticos , ARN Ribosómico 16S/genética , Análisis de Secuencia de ADN
19.
J Math Biol ; 70(1-2): 45-69, 2015 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24448658

RESUMEN

We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the 'weighted information content' of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000 bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the 'coarse scale' problem of predicting CDS density. Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750 and 5,000 bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at http://sourceforge.net/projects/topologicalpres/ .


Asunto(s)
Genoma Humano , Modelos Genéticos , Animales , Inteligencia Artificial , Secuencia de Bases , Drosophila melanogaster/genética , Exones , Genómica/estadística & datos numéricos , Humanos , Intrones , Macaca mulatta/genética , Conceptos Matemáticos , Ratones , Análisis de Secuencia de ADN/estadística & datos numéricos , Programas Informáticos
20.
Bioinformatics ; 29(17): 2096-102, 2013 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-23786768

RESUMEN

MOTIVATION: Many metagenomic studies compare hundreds to thousands of environmental and health-related samples by extracting and sequencing their 16S rRNA amplicons and measuring their similarity using beta-diversity metrics. However, one of the first steps--to classify the operational taxonomic units within the sample--can be a computationally time-consuming task because most methods rely on computing the taxonomic assignment of each individual read out of tens to hundreds of thousands of reads. RESULTS: We introduce Quikr: a QUadratic, K-mer-based, Iterative, Reconstruction method, which computes a vector of taxonomic assignments and their proportions in the sample using an optimization technique motivated from the mathematical theory of compressive sensing. On both simulated and actual biological data, we demonstrate that Quikr typically has less error and is typically orders of magnitude faster than the most commonly used taxonomic assignment technique (the Ribosomal Database Project's Naïve Bayesian Classifier). Furthermore, the technique is shown to be unaffected by the presence of chimeras, thereby allowing for the circumvention of the time-intensive step of chimera filtering. AVAILABILITY: The Quikr computational package (in MATLAB, Octave, Python and C) for the Linux and Mac platforms is available at http://sourceforge.net/projects/quikr/.


Asunto(s)
Bacterias/clasificación , Análisis de Secuencia de ADN/métodos , Algoritmos , Bacterias/genética , Bacterias/aislamiento & purificación , Teorema de Bayes , Clasificación/métodos , Metagenómica , Microbiota , Filogenia , ARN Ribosómico 16S/genética , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA