Búsqueda | Portal Regional de la BVS

1.

Prokrustean Graph: A substring index supporting rapid enumeration across a range of k-mer sizes.

Park, Adam; Koslicki, David.

bioRxiv ; 2024 Jun 01.

Artículo en Inglés | MEDLINE | ID: mdl-38853857

RESUMEN

Despite the widespread adoption of k -mer-based methods in bioinformatics, a fundamental question persists: How can we quantify the influence of k sizes in applications? With no universal answer available, choosing an optimal k size or employing multiple k sizes remains application-specific, arbitrary, and computationally expensive. The assessment of the primary parameter k is typically empirical, based on the end products of applications which pass complex processes of genome analysis, comparison, assembly, alignment, and error correction. The elusiveness of the problem stems from a limited understanding of the transitions of k -mers with respect to k sizes. Indeed, there is considerable room for improving both practice and theory by exploring k -mer-specific quantities across multiple k sizes. This paper introduces an algorithmic framework built upon a novel substring representation: the Prokrustean graph. The primary functionality of this framework is to extract various k -mer-based quantities across a range of k sizes, but its computational complexity depends only on maximal repeats, not on the k range. For example, counting maximal unitigs of de Bruijn graphs for k = 10 , , 100 takes just a few seconds with a Prokrustean graph built on a read set of gigabases in size. This efficiency sets the graph apart from other substring indices, such as the FM-index, which are normally optimized for string pattern searching rather than for depicting the substring structure across varying lengths. However, the Prokrustean graph is expected to close this gap, as it can be built using the extended Burrows-Wheeler Transform (eBWT) in a space-efficient manner. The framework is particularly useful in pangenome and metagenome analyses, where the demand for precise multi- k approaches is increasing due to the complex and diverse nature of the information being managed. We introduce four applications implemented with the framework that extract key quantities actively utilized in modern pangenomics and metagenomics.

2.

Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation.

Hera, Mahmudur Rahman; Koslicki, David.

bioRxiv ; 2024 May 30.

Artículo en Inglés | MEDLINE | ID: mdl-38854044

RESUMEN

Motivation: The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics, such as the cosine similarity, are still lacking. Theoretical contributions: In this paper, we present a theoretical framework for estimating cosine similarity from FracMinHash sketches. We establish conditions under which this estimation is sound, and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings. Practical contributions: We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise cosine similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

3.

MetagenomicKG: a knowledge graph for metagenomic applications.

Ma, Chunyu; Liu, Shaopeng; Koslicki, David.

bioRxiv ; 2024 Mar 15.

Artículo en Inglés | MEDLINE | ID: mdl-38559251

RESUMEN

Motivation: The sheer volume and variety of genomic content within microbial communities makes metagenomics a field rich in biomedical knowledge. To traverse these complex communities and their vast unknowns, metagenomic studies often depend on distinct reference databases, such as the Genome Taxonomy Database (GTDB), the Kyoto Encyclopedia of Genes and Genomes (KEGG), and the Bacterial and Viral Bioinformatics Resource Center (BV-BRC), for various analytical purposes. These databases are crucial for genetic and functional annotation of microbial communities. Nevertheless, the inconsistent nomenclature or identifiers of these databases present challenges for effective integration, representation, and utilization. Knowledge graphs (KGs) offer an appropriate solution by organizing biological entities and their interrelations into a cohesive network. The graph structure not only facilitates the unveiling of hidden patterns but also enriches our biological understanding with deeper insights. Despite KGs having shown potential in various biomedical fields, their application in metagenomics remains underexplored. Results: We present MetagenomicKG, a novel knowledge graph specifically tailored for metagenomic analysis. MetagenomicKG integrates taxonomic, functional, and pathogenesis-related information from widely used databases, and further links these with established biomedical knowledge graphs to expand biological connections. Through several use cases, we demonstrate its utility in enabling hypothesis generation regarding the relationships between microbes and diseases, generating sample-specific graph embeddings, and providing robust pathogen prediction. Availability and Implementation: The source code and technical details for constructing the MetagenomicKG and reproducing all analyses are available at Github: https://github.com/KoslickiLab/MetagenomicKG. We also host a Neo4j instance: http://mkg.cse.psu.edu:7474 for accessing and querying this graph.

4.

YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample.

Koslicki, David; White, Stephen; Ma, Chunyu; Novikov, Alexei.

Bioinformatics ; 40(2)2024 02 01.

Artículo en Inglés | MEDLINE | ID: mdl-38268451

RESUMEN

MOTIVATION: In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. Existing tools generally return point estimates, with no associated confidence or uncertainty associated with it. This has led to practitioners experiencing difficulty when interpreting the results from these tools, particularly for low-abundance organisms as these often reside in the "noisy tail" of incorrect predictions. Furthermore, few tools account for the fact that reference databases are often incomplete and rarely, if ever, contain exact replicas of genomes present in an environmentally derived metagenome. RESULTS: We present solutions for these issues by introducing the algorithm YACHT: Yes/No Answers to Community membership via Hypothesis Testing. This approach introduces a statistical framework that accounts for sequence divergence between the reference and sample genomes, in terms of ANI, as well as incomplete sequencing depth, thus providing a hypothesis test for determining the presence or absence of a reference genome in a sample. After introducing our approach, we quantify its statistical power and how this changes with varying parameters. Subsequently, we perform extensive experiments using both simulated and real data to confirm the accuracy and scalability of this approach. AVAILABILITY AND IMPLEMENTATION: The source code implementing this approach is available via Conda and at https://github.com/KoslickiLab/YACHT. We also provide the code for reproducing experiments at https://github.com/KoslickiLab/YACHT-reproducibles.

Asunto(s)

Metagenoma , Microbiota , Microbiota/genética , Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN/métodos , Metagenómica/métodos

5.

An approach for collaborative development of a federated biomedical knowledge graph-based question-answering system: Question-of-the-Month challenges.

Fecho, Karamarie; Bizon, Chris; Issabekova, Tursynay; Moxon, Sierra; Thessen, Anne E; Abdollahi, Shervin; Baranzini, Sergio E; Belhu, Basazin; Byrd, William E; Chung, Lawrence; Crouse, Andrew; Duby, Marc P; Ferguson, Stephen; Foksinska, Aleksandra; Forero, Laura; Friedman, Jennifer; Gardner, Vicki; Glusman, Gwênlyn; Hadlock, Jennifer; Hanspers, Kristina; Hinderer, Eugene; Hobbs, Charlotte; Hyde, Gregory; Huang, Sui; Koslicki, David; Mease, Philip; Muller, Sandrine; Mungall, Christopher J; Ramsey, Stephen A; Roach, Jared; Rubin, Irit; Schurman, Shepherd H; Shalev, Anath; Smith, Brett; Soman, Karthik; Stemann, Sarah; Su, Andrew I; Ta, Casey; Watkins, Paul B; Williams, Mark D; Wu, Chunlei; Xu, Colleen H.

J Clin Transl Sci ; 7(1): e214, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37900350

RESUMEN

Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium, we have developed a knowledge graph-based question-answering system designed to augment human reasoning and accelerate translational scientific discovery: the Translator system. We have applied the Translator system to answer biomedical questions in the context of a broad array of diseases and syndromes, including Fanconi anemia, primary ciliary dyskinesia, multiple sclerosis, and others. A variety of collaborative approaches have been used to research and develop the Translator system. One recent approach involved the establishment of a monthly "Question-of-the-Month (QotM) Challenge" series. Herein, we describe the structure of the QotM Challenge; the six challenges that have been conducted to date on drug-induced liver injury, cannabidiol toxicity, coronavirus infection, diabetes, psoriatic arthritis, and ATP1A3-related phenotypes; the scientific insights that have been gleaned during the challenges; and the technical issues that were identified over the course of the challenges and that can now be addressed to foster further development of the prototype Translator system. We close with a discussion on Large Language Models such as ChatGPT and highlight differences between those models and the Translator system.

6.

Finding phylogeny-aware and biologically meaningful averages of metagenomic samples: L2UniFrac.

Wei, Wei; Millward, Andrew; Koslicki, David.

Bioinformatics ; 39(39 Suppl 1): i57-i65, 2023 06 30.

Artículo en Inglés | MEDLINE | ID: mdl-37387190

RESUMEN

MOTIVATION: Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely used metric for measuring the variability between metagenomic samples. We propose that the characterization of metagenomic environments can be improved by finding the average, a.k.a. the barycenter, among the samples with respect to the UniFrac distance. However, it is possible that such a UniFrac-average includes negative entries, making it no longer a valid representation of a metagenomic community. RESULTS: To overcome this intrinsic issue, we propose a special version of the UniFrac metric, termed L2UniFrac, which inherits the phylogenetic nature of the traditional UniFrac and with respect to which one can easily compute the average, producing biologically meaningful environment-specific "representative samples." We demonstrate the usefulness of such representative samples as well as the extended usage of L2UniFrac in efficient clustering of metagenomic samples, and provide mathematical characterizations and proofs to the desired properties of L2UniFrac. AVAILABILITY AND IMPLEMENTATION: A prototype implementation is provided at https://github.com/KoslickiLab/L2-UniFrac.git. All figures, data, and analysis can be reproduced at https://github.com/KoslickiLab/L2-UniFrac-Paper.

Asunto(s)

Metagenoma , Metagenómica , Filogenia , Análisis por Conglomerados

7.

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.

Rahman Hera, Mahmudur; Pierce-Ward, N Tessa; Koslicki, David.

Genome Res ; 33(7): 1061-1068, 2023 07.

Artículo en Inglés | MEDLINE | ID: mdl-37344105

RESUMEN

Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool sourmash gather. Although experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that although FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases in which these analyses may fail to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely compared with traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances.

Asunto(s)

Evolución Biológica , Tasa de Mutación , Intervalos de Confianza , Metagenoma , Metagenómica/métodos

8.

YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample.

Koslicki, David; White, Stephen; Ma, Chunyu; Novikov, Alexei.

bioRxiv ; 2023 Apr 20.

Artículo en Inglés | MEDLINE | ID: mdl-37131762

RESUMEN

In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. While tools exist to answer this question, all existing approaches to date return point estimates, with no associated confidence or uncertainty associated with it. This has led to practitioners experiencing difficulty when interpreting the results from these tools, particularly for low abundance organisms as these often reside in the "noisy tail" of incorrect predictions. Furthermore, no tools to date account for the fact that reference databases are often incomplete and rarely, if ever, contain exact replicas of genomes present in an environmentally derived metagenome. In this work, we present solutions for these issues by introducing the algorithm YACHT: Yes/No Answers to Community membership via Hypothesis Testing. This approach introduces a statistical framework that accounts for sequence divergence between the reference and sample genomes, in terms of average nucleotide identity, as well as incomplete sequencing depth, thus providing a hypothesis test for determining the presence or absence of a reference genome in a sample. After introducing our approach, we quantify its statistical power as well as quantify theoretically how this changes with varying parameters. Subsequently, we perform extensive experiments using both simulated and real data to confirm the accuracy and scalability of this approach. Code implementing this approach, as well as all experiments performed, is available at https://github.com/KoslickiLab/YACHT.

9.

Finding phylogeny-aware and biologically meaningful averages of metagenomic samples: L ₂ UniFrac.

Wei, Wei; Millward, Andrew; Koslicki, David.

bioRxiv ; 2023 Feb 03.

Artículo en Inglés | MEDLINE | ID: mdl-36778267

RESUMEN

Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely-used metric for measuring the variability between metagenomic samples. We propose that the characterization of metagenomic environments can be achieved by finding the average, a.k.a. the barycenter, among the samples with respect to the UniFrac distance. However, it is possible that such a UniFrac-average includes negative entries, making it no longer a valid representation of a metagenomic community. To overcome this intrinsic issue, we propose a special version of the UniFrac metric, termed L 2 UniFrac, which inherits the phylogenetic nature of the traditional UniFrac and with respect to which one can easily compute the average, producing biologically meaningful environment-specific "representative samples". We demonstrate the usefulness of such representative samples as well as the extended usage of L 2 UniFrac in efficient clustering of metagenomic samples, and provide mathematical characterizations and proofs to the desired properties of L 2 UniFrac. A prototype implementation is provided at: https://github.com/KoslickiLab/L2-UniFrac.git .

10.

ARAX: a graph-based modular reasoning tool for translational biomedicine.

Glen, Amy K; Ma, Chunyu; Mendoza, Luis; Womack, Finn; Wood, E C; Sinha, Meghamala; Acevedo, Liliana; Kvarfordt, Lindsey G; Peene, Ross C; Liu, Shaopeng; Hoffman, Andrew S; Roach, Jared C; Deutsch, Eric W; Ramsey, Stephen A; Koslicki, David.

Bioinformatics ; 39(3)2023 03 01.

Artículo en Inglés | MEDLINE | ID: mdl-36752514

RESUMEN

MOTIVATION: With the rapidly growing volume of knowledge and data in biomedical databases, improved methods for knowledge-graph-based computational reasoning are needed in order to answer translational questions. Previous efforts to solve such challenging computational reasoning problems have contributed tools and approaches, but progress has been hindered by the lack of an expressive analysis workflow language for translational reasoning and by the lack of a reasoning engine-supporting that language-that federates semantically integrated knowledge-bases. RESULTS: We introduce ARAX, a new reasoning system for translational biomedicine that provides a web browser user interface and an application programming interface (API). ARAX enables users to encode translational biomedical questions and to integrate knowledge across sources to answer the user's query and facilitate exploration of results. For ARAX, we developed new approaches to query planning, knowledge-gathering, reasoning and result ranking and dynamically integrate knowledge providers for answering biomedical questions. To illustrate ARAX's application and utility in specific disease contexts, we present several use-case examples. AVAILABILITY AND IMPLEMENTATION: The source code and technical documentation for building the ARAX server-side software and its built-in knowledge database are freely available online (https://github.com/RTXteam/RTX). We provide a hosted ARAX service with a web browser interface at arax.rtx.ai and a web API endpoint at arax.rtx.ai/api/arax/v1.3/ui/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Bases del Conocimiento , Programas Informáticos , Bases de Datos Factuales , Lenguaje , Navegador Web

11.

RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine.

Wood, E C; Glen, Amy K; Kvarfordt, Lindsey G; Womack, Finn; Acevedo, Liliana; Yoon, Timothy S; Ma, Chunyu; Flores, Veronica; Sinha, Meghamala; Chodpathumwan, Yodsawalai; Termehchy, Arash; Roach, Jared C; Mendoza, Luis; Hoffman, Andrew S; Deutsch, Eric W; Koslicki, David; Ramsey, Stephen A.

BMC Bioinformatics ; 23(1): 400, 2022 Sep 29.

Artículo en Inglés | MEDLINE | ID: mdl-36175836

RESUMEN

BACKGROUND: Biomedical translational science is increasingly using computational reasoning on repositories of structured knowledge (such as UMLS, SemMedDB, ChEMBL, Reactome, DrugBank, and SMPDB in order to facilitate discovery of new therapeutic targets and modalities. The NCATS Biomedical Data Translator project is working to federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions. Within that project and the broader field, there is a need for a framework that can efficiently and reproducibly build an integrated, standards-compliant, and comprehensive biomedical knowledge graph that can be downloaded in standard serialized form or queried via a public application programming interface (API). RESULTS: To create a knowledge provider system within the Translator project, we have developed RTX-KG2, an open-source software system for building-and hosting a web API for querying-a biomedical knowledge graph that uses an Extract-Transform-Load approach to integrate 70 knowledge sources (including the aforementioned core six sources) into a knowledge graph with provenance information including (where available) citations. The semantic layer and schema for RTX-KG2 follow the standard Biolink model to maximize interoperability. RTX-KG2 is currently being used by multiple Translator reasoning agents, both in its downloadable form and via its SmartAPI-registered interface. Serializations of RTX-KG2 are available for download in both the pre-canonicalized form and in canonicalized form (in which synonyms are merged). The current canonicalized version (KG2.7.3) of RTX-KG2 contains 6.4M nodes and 39.3M edges with a hierarchy of 77 relationship types from Biolink. CONCLUSION: RTX-KG2 is the first knowledge graph that integrates UMLS, SemMedDB, ChEMBL, DrugBank, Reactome, SMPDB, and 64 additional knowledge sources within a knowledge graph that conforms to the Biolink standard for its semantic layer and schema. RTX-KG2 is publicly available for querying via its API at arax.rtx.ai/api/rtxkg2/v1.2/openapi.json . The code to build RTX-KG2 is publicly available at github:RTXteam/RTX-KG2 .

Asunto(s)

Conocimiento , Reconocimiento de Normas Patrones Automatizadas , Semántica , Programas Informáticos , Ciencia Traslacional Biomédica

12.

The minimizer Jaccard estimator is biased and inconsistent.

Belbasi, Mahdi; Blanca, Antonio; Harris, Robert S; Koslicki, David; Medvedev, Paul.

Bioinformatics ; 38(Suppl 1): i169-i176, 2022 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-35758786

RESUMEN

MOTIVATION: Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. RESULTS: We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool. AVAILABILITY AND IMPLEMENTATION: Scripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Programas Informáticos

13.

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.

Liu, Shaopeng; Koslicki, David.

Bioinformatics ; 38(Suppl 1): i28-i35, 2022 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-35758788

RESUMEN

MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Programas Informáticos , Biología Computacional/métodos , Metagenómica , Análisis de Secuencia de ADN/métodos

14.

Critical Assessment of Metagenome Interpretation: the second round of challenges.

Meyer, Fernando; Fritz, Adrian; Deng, Zhi-Luo; Koslicki, David; Lesker, Till Robin; Gurevich, Alexey; Robertson, Gary; Alser, Mohammed; Antipov, Dmitry; Beghini, Francesco; Bertrand, Denis; Brito, Jaqueline J; Brown, C Titus; Buchmann, Jan; Buluç, Aydin; Chen, Bo; Chikhi, Rayan; Clausen, Philip T L C; Cristian, Alexandru; Dabrowski, Piotr Wojciech; Darling, Aaron E; Egan, Rob; Eskin, Eleazar; Georganas, Evangelos; Goltsman, Eugene; Gray, Melissa A; Hansen, Lars Hestbjerg; Hofmeyr, Steven; Huang, Pingqin; Irber, Luiz; Jia, Huijue; Jørgensen, Tue Sparholt; Kieser, Silas D; Klemetsen, Terje; Kola, Axel; Kolmogorov, Mikhail; Korobeynikov, Anton; Kwan, Jason; LaPierre, Nathan; Lemaitre, Claire; Li, Chenhao; Limasset, Antoine; Malcher-Miranda, Fabio; Mangul, Serghei; Marcelino, Vanessa R; Marchet, Camille; Marijon, Pierre; Meleshko, Dmitry; Mende, Daniel R; Milanese, Alessio.

Nat Methods ; 19(4): 429-440, 2022 04.

Artículo en Inglés | MEDLINE | ID: mdl-35396482

RESUMEN

Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.

Asunto(s)

Metagenoma , Metagenómica , Archaea/genética , Metagenómica/métodos , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN , Programas Informáticos

15.

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.

Blanca, Antonio; Harris, Robert S; Koslicki, David; Medvedev, Paul.

J Comput Biol ; 29(2): 155-168, 2022 02.

Artículo en Inglés | MEDLINE | ID: mdl-35108101

RESUMEN

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Asunto(s)

Mutación , Análisis de Secuencia de ADN/estadística & datos numéricos , Algoritmos , Secuencia de Bases , Biología Computacional , Intervalos de Confianza , Genómica/estadística & datos numéricos , Humanos , Modelos Genéticos , Alineación de Secuencia/estadística & datos numéricos , Programas Informáticos

16.

KGML-xDTD: a knowledge graph-based machine learning framework for drug treatment prediction and mechanism description.

Ma, Chunyu; Zhou, Zhihan; Liu, Han; Koslicki, David.

Gigascience ; 122022 12 28.

Artículo en Inglés | MEDLINE | ID: mdl-37602759

RESUMEN

BACKGROUND: Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery approaches. However, the underlying mechanisms of action (MOAs) between repurposed drugs and their target diseases remain largely unknown, which is still a main obstacle for computational drug repurposing methods to be widely adopted in clinical settings. RESULTS: In this work, we propose KGML-xDTD: a Knowledge Graph-based Machine Learning framework for explainably predicting Drugs Treating Diseases. It is a 2-module framework that not only predicts the treatment probabilities between drugs/compounds and diseases but also biologically explains them via knowledge graph (KG) path-based, testable MOAs. We leverage knowledge-and-publication-based information to extract biologically meaningful "demonstration paths" as the intermediate guidance in the Graph-based Reinforcement Learning (GRL) path-finding process. Comprehensive experiments and case study analyses show that the proposed framework can achieve state-of-the-art performance in both predictions of drug repurposing and recapitulation of human-curated drug MOA paths. CONCLUSIONS: KGML-xDTD is the first model framework that can offer KG path explanations for drug repurposing predictions by leveraging the combination of prediction outcomes and existing biological knowledge and publications. We believe it can effectively reduce "black-box" concerns and increase prediction confidence for drug repurposing based on predicted path-based explanations and further accelerate the process of drug discovery for emerging diseases.

Asunto(s)

Descubrimiento de Drogas , Reconocimiento de Normas Patrones Automatizadas , Humanos , Conocimiento , Aprendizaje Automático , Probabilidad

17.

TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles.

Sarwal, Varuni; Brito, Jaqueline; Mangul, Serghei; Koslicki, David.

Gigascience ; 122022 12 28.

Artículo en Inglés | MEDLINE | ID: mdl-36852763

RESUMEN

BACKGROUND: Metagenomic taxonomic profiling aims to predict the identity and relative abundance of taxa in a given whole-genome sequencing metagenomic sample. A recent surge in computational methods that aim to accurately estimate taxonomic profiles, called taxonomic profilers, has motivated community-driven efforts to create standardized benchmarking datasets and platforms, standardized taxonomic profile formats, and a benchmarking platform to assess tool performance. While this standardization is essential, there is currently a lack of tools to visualize the standardized output of the many existing taxonomic profilers. Thus, benchmarking studies rely on a single-value metrics to compare performance of tools and compare to benchmarking datasets. This is one of the major problems in analyzing metagenomic profiling data, since single metrics, such as the F1 score, fail to capture the biological differences between the datasets. FINDINGS: Here we report the development of TAMPA (Taxonomic metagenome profiling evaluation), a robust and easy-to-use method that allows scientists to easily interpret and interact with taxonomic profiles produced by the many different taxonomic profiler methods beyond the standard metrics used by the scientific community. We demonstrate the unique ability of TAMPA to generate a novel biological hypothesis by highlighting the taxonomic differences between samples otherwise missed by commonly utilized metrics. CONCLUSION: In this study, we show that TAMPA can help visualize the output of taxonomic profilers, enabling biologists to effectively choose the most appropriate profiling method to use on their metagenomics data. TAMPA is available on GitHub, Bioconda, and Galaxy Toolshed at https://github.com/dkoslicki/TAMPA and is released under the MIT license.

Asunto(s)

Benchmarking , Metagenómica , Metagenoma , Secuenciación Completa del Genoma

18.

Technology dictates algorithms: recent developments in read alignment.

Alser, Mohammed; Rotman, Jeremy; Deshpande, Dhrithi; Taraszka, Kodi; Shi, Huwenbo; Baykal, Pelin Icer; Yang, Harry Taegyun; Xue, Victor; Knyazev, Sergey; Singer, Benjamin D; Balliu, Brunilda; Koslicki, David; Skums, Pavel; Zelikovsky, Alex; Alkan, Can; Mutlu, Onur; Mangul, Serghei.

Genome Biol ; 22(1): 249, 2021 08 26.

Artículo en Inglés | MEDLINE | ID: mdl-34446078

RESUMEN

Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Alineación de Secuencia , Genoma Humano , VIH/fisiología , Humanos , Metagenómica , Sulfitos

19.

Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit.

Meyer, Fernando; Lesker, Till-Robin; Koslicki, David; Fritz, Adrian; Gurevich, Alexey; Darling, Aaron E; Sczyrba, Alexander; Bremges, Andreas; McHardy, Alice C.

Nat Protoc ; 16(4): 1785-1801, 2021 04.

Artículo en Inglés | MEDLINE | ID: mdl-33649565

RESUMEN

Computational methods are key in microbiome research, and obtaining a quantitative and unbiased performance estimate is important for method developers and applied researchers. For meaningful comparisons between methods, to identify best practices and common use cases, and to reduce overhead in benchmarking, it is necessary to have standardized datasets, procedures and metrics for evaluation. In this tutorial, we describe emerging standards in computational meta-omics benchmarking derived and agreed upon by a larger community of researchers. Specifically, we outline recent efforts by the Critical Assessment of Metagenome Interpretation (CAMI) initiative, which supplies method developers and applied researchers with exhaustive quantitative data about software performance in realistic scenarios and organizes community-driven benchmarking challenges. We explain the most relevant evaluation metrics for assessing metagenome assembly, binning and profiling results, and provide step-by-step instructions on how to generate them. The instructions use simulated mouse gut metagenome data released in preparation for the second round of CAMI challenges and showcase the use of a repository of tool results for CAMI datasets. This tutorial will serve as a reference for the community and facilitate informative and reproducible benchmarking in microbiome research.

Asunto(s)

Benchmarking , Metagenómica/métodos , Programas Informáticos , Animales , Simulación por Computador , Bases de Datos Genéticas , Microbioma Gastrointestinal/genética , Metagenoma , Ratones , Filogenia , Estándares de Referencia , Reproducibilidad de los Resultados

20.

Application of MCAT questions as a testing tool and evaluation metric for knowledge graph-based reasoning systems.

Fecho, Karamarie; Balhoff, James; Bizon, Chris; Byrd, William E; Hang, Sui; Koslicki, David; Rensi, Stefano E; Schmitt, Patrick L; Wawer, Mathias J; Williams, Mark; Ahalt, Stanley C.

Clin Transl Sci ; 14(5): 1719-1724, 2021 09.

Artículo en Inglés | MEDLINE | ID: mdl-33742785

RESUMEN

"Knowledge graphs" (KGs) have become a common approach for representing biomedical knowledge. In a KG, multiple biomedical data sets can be linked together as a graph representation, with nodes representing entities, such as "chemical substance" or "genes," and edges representing predicates, such as "causes" or "treats." Reasoning and inference algorithms can then be applied to the KG and used to generate new knowledge. We developed three KG-based question-answering systems as part of the Biomedical Data Translator program. These systems are typically tested and evaluated using traditional software engineering tools and approaches. In this study, we explored a team-based approach to test and evaluate the prototype "Translator Reasoners" through the application of Medical College Admission Test (MCAT) questions. Specifically, we describe three "hackathons," in which the developers of each of the three systems worked together with a moderator to determine whether the applications could be used to solve MCAT questions. The results demonstrate progressive improvement in system performance, with 0% (0/5) correct answers during the first hackathon, 75% (3/4) correct during the second hackathon, and 100% (5/5) correct during the final hackathon. We discuss the technical and sociologic lessons learned and conclude that MCAT questions can be applied successfully in the context of moderated hackathons to test and evaluate prototype KG-based question-answering systems, identify gaps in current capabilities, and improve performance. Finally, we highlight several published clinical and translational science applications of the Translator Reasoners.

Asunto(s)

Reconocimiento de Normas Patrones Automatizadas/métodos , Ciencia Traslacional Biomédica/métodos , Algoritmos , Prueba de Admisión Académica/estadística & datos numéricos , Conjuntos de Datos como Asunto , Humanos

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA