Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
Más filtros

Banco de datos
Tipo del documento
País de afiliación
Intervalo de año de publicación
1.
Nat Methods ; 19(7): 845-853, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-35773532

RESUMEN

16S ribosomal RNA-based analysis is the established standard for elucidating the composition of microbial communities. While short-read 16S rRNA analyses are largely confined to genus-level resolution at best, given that only a portion of the gene is sequenced, full-length 16S rRNA gene amplicon sequences have the potential to provide species-level accuracy. However, existing taxonomic identification algorithms are not optimized for the increased read length and error rate often observed in long-read data. Here we present Emu, an approach that uses an expectation-maximization algorithm to generate taxonomic abundance profiles from full-length 16S rRNA reads. Results produced from simulated datasets and mock communities show that Emu is capable of accurate microbial community profiling while obtaining fewer false positives and false negatives than alternative methods. Additionally, we illustrate a real-world application of Emu by comparing clinical sample composition estimates generated by an established whole-genome shotgun sequencing workflow with those returned by full-length 16S rRNA gene sequences processed with Emu.


Asunto(s)
Dromaiidae , Microbiota , Secuenciación de Nanoporos , Animales , Bacterias/genética , Dromaiidae/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Microbiota/genética , Filogenia , ARN Ribosómico 16S/genética , Análisis de Secuencia de ADN/métodos
2.
Bioinformatics ; 40(5)2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38724243

RESUMEN

MOTIVATION: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. RESULTS: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. AVAILABILITY AND IMPLEMENTATION: Parsnp v2 is available at https://github.com/marbl/parsnp.


Asunto(s)
Genoma Bacteriano , Alineación de Secuencia , Programas Informáticos , Alineación de Secuencia/métodos , Genómica/métodos , Algoritmos
3.
Bioinformatics ; 39(39 Suppl 1): i47-i56, 2023 06 30.
Artículo en Inglés | MEDLINE | ID: mdl-37387148

RESUMEN

MOTIVATION: Interactions among microbes within microbial communities have been shown to play crucial roles in human health. In spite of recent progress, low-level knowledge of bacteria driving microbial interactions within microbiomes remains unknown, limiting our ability to fully decipher and control microbial communities. RESULTS: We present a novel approach for identifying species driving interactions within microbiomes. Bakdrive infers ecological networks of given metagenomic sequencing samples and identifies minimum sets of driver species (MDS) using control theory. Bakdrive has three key innovations in this space: (i) it leverages inherent information from metagenomic sequencing samples to identify driver species, (ii) it explicitly takes host-specific variation into consideration, and (iii) it does not require a known ecological network. In extensive simulated data, we demonstrate identifying driver species identified from healthy donor samples and introducing them to the disease samples, we can restore the gut microbiome in recurrent Clostridioides difficile (rCDI) infection patients to a healthy state. We also applied Bakdrive to two real datasets, rCDI and Crohn's disease patients, uncovering driver species consistent with previous work. Bakdrive represents a novel approach for capturing microbial interactions. AVAILABILITY AND IMPLEMENTATION: Bakdrive is open-source and available at: https://gitlab.com/treangenlab/bakdrive.


Asunto(s)
Enfermedad de Crohn , Microbioma Gastrointestinal , Microbiota , Humanos , Metagenoma , Bacterias/genética
4.
BMC Genomics ; 21(1): 133, 2020 02 10.
Artículo en Inglés | MEDLINE | ID: mdl-32039710

RESUMEN

After publication of [1], the authors were informed by John A. Rhodes of a counterexample to Theorem 11 of [1].

5.
Syst Biol ; 68(3): 396-411, 2019 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-30329135

RESUMEN

The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.].


Asunto(s)
Clasificación/métodos , Bases de Datos de Proteínas , Modelos Estadísticos , Alineación de Secuencia/normas , Simulación por Computador , Conjuntos de Datos como Asunto
6.
Syst Biol ; 68(2): 281-297, 2019 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-30247732

RESUMEN

With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.


Asunto(s)
Clasificación/métodos , Filogenia , Funciones de Verosimilitud , Modelos Genéticos
7.
BMC Genomics ; 19(Suppl 5): 286, 2018 May 08.
Artículo en Inglés | MEDLINE | ID: mdl-29745854

RESUMEN

BACKGROUND: Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. Methods to estimate species trees in the presence of gene tree discord due to incomplete lineage sorting have been developed and proved to be statistically consistent when gene tree discord is due only to incomplete lineage sorting and every gene tree includes the full set of species. RESULTS: We establish statistical consistency of certain coalescent-based species tree estimation methods under some models of taxon deletion from genes. We also evaluate the impact of missing data on four species tree estimation methods (ASTRAL-II, ASTRID, MP-EST, and SVDquartets) using simulated datasets with varying levels of incomplete lineage sorting, gene tree estimation error, and degrees/patterns of missing data. CONCLUSIONS: All the species tree estimation methods improved in accuracy as the number of genes increased and often produced highly accurate species trees even when the amount of missing data was large. These results together indicate that accurate species tree estimation is possible under a variety of conditions, even when there are substantial amounts of missing data.


Asunto(s)
Clasificación/métodos , Especiación Genética , Modelos Genéticos , Filogenia , Algoritmos , Simulación por Computador , Genes , Genómica , Especificidad de la Especie
8.
J Foot Ankle Surg ; 57(1): 60-64, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29268903

RESUMEN

Tendon transfers are often performed in the foot and ankle. Recently, interference screws have been a popular choice owing to their ease of use and fixation strength. Considering the benefits, one disadvantage of such devices is laceration of the soft tissues by the implant threads during placement that potentially weaken the structural integrity of the grafts. A shape memory polyetheretherketone bullet-in-sheath tenodesis device uses circumferential compression, eliminating potential damage from thread rotation and maintaining the soft tissue orientation of the graft. The aim of this study was to determine the pullout strength and failure mode for this device in both a synthetic bone analogue and porcine bone models. Thirteen mature bovine extensor tendons were secured into ten 4.0 × 4.0 × 4.0-cm cubes of 15-pound per cubic foot solid rigid polyurethane foam bone analogue models or 3 porcine femoral condyles using the 5 × 20-mm polyetheretherketone soft tissue anchor. The bullet-in-sheath device demonstrated a mean pullout of 280.84 N in the bone analog models and 419.47 N in the porcine bone models. (p = .001). The bullet-in-sheath design preserved the integrity of the tendon graft, and none of the implants dislodged from their original position.


Asunto(s)
Articulación del Tobillo/cirugía , Pie/cirugía , Cetonas , Polietilenglicoles , Anclas para Sutura , Transferencia Tendinosa/métodos , Animales , Benzofenonas , Fenómenos Biomecánicos , Bovinos , Modelos Anatómicos , Polímeros , Sensibilidad y Especificidad , Porcinos , Resistencia a la Tracción
9.
BMC Genomics ; 17(Suppl 10): 764, 2016 11 11.
Artículo en Inglés | MEDLINE | ID: mdl-28185555

RESUMEN

BACKGROUND: Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today. RESULTS: We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences. CONCLUSIONS: Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Teorema de Bayes , Bases de Datos Genéticas , Alineación de Secuencia
10.
BMC Genomics ; 17(Suppl 10): 765, 2016 11 11.
Artículo en Inglés | MEDLINE | ID: mdl-28185571

RESUMEN

BACKGROUND: Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. RESULTS: We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. CONCLUSION: HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp .


Asunto(s)
Algoritmos , Proteínas/clasificación , Biología Computacional , Bases de Datos de Proteínas , Internet , Cadenas de Markov , Proteínas/química , Proteínas/metabolismo , Alineación de Secuencia , Interfaz Usuario-Computador
12.
bioRxiv ; 2024 Jan 31.
Artículo en Inglés | MEDLINE | ID: mdl-38352342

RESUMEN

Motivation: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. Results: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4x and reduce runtime by over 2x, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. Availability: Parsnp is available at https://github.com/marbl/parsnp.

13.
Curr Protoc ; 4(3): e978, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38511467

RESUMEN

16S rRNA targeted amplicon sequencing is an established standard for elucidating microbial community composition. While high-throughput short-read sequencing can elicit only a portion of the 16S rRNA gene due to their limited read length, third generation sequencing can read the 16S rRNA gene in its entirety and thus provide more precise taxonomic classification. Here, we present a protocol for generating full-length 16S rRNA sequences with Oxford Nanopore Technologies (ONT) and a microbial community profile with Emu. We select Emu for analyzing ONT sequences as it leverages information from the entire community to overcome errors due to incomplete reference databases and hardware limitations to ultimately obtain species-level resolution. This pipeline provides a low-cost solution for characterizing microbiome composition by exploiting real-time, long-read ONT sequencing and tailored software for accurate characterization of microbial communities. © 2024 Wiley Periodicals LLC. Basic Protocol: Microbial community profiling with Emu Support Protocol 1: Full-length 16S rRNA microbial sequences with Oxford Nanopore Technologies sequencing platform Support Protocol 2: Building a custom reference database for Emu.


Asunto(s)
Dromaiidae , Microbiota , Animales , ARN Ribosómico 16S/genética , Dromaiidae/genética , Bacterias/genética , Análisis de Secuencia de ADN/métodos , Microbiota/genética
14.
bioRxiv ; 2024 Jun 03.
Artículo en Inglés | MEDLINE | ID: mdl-38895276

RESUMEN

Taxonomic profiling is a ubiquitous task in the analysis of clinical and environmental microbiomes. The advent of long-read sequencing of microbiomes necessitates the development of new taxonomic profilers tailored to long-read shotgun metagenomic datasets. Here, we introduce Lemur and Magnet, a pair of tools optimized for lightweight and accurate taxonomic profiling from long-read shotgun metagenomic datasets. Lemur is a marker-gene based method that leverages an EM algorithm to reduce false positive calls while preserving true positives; Magnet makes detailed presence/absence calls for bacterial genomes based on whole-genome read mapping. The tools work in sequence: Lemur estimates abundances conservatively, and Magnet operates on the genomes of identified organisms to filter out likely false positive taxa. The result is an increase in precision of as much as 70%, which far exceeds competing methods. By operating only on marker genes, Lemur is a comparatively lightweight software. We demonstrate that it can run in minutes to hours on a laptop with 32 GB of RAM, even for large inputs - a crucial feature given the portability of long-read sequencing machines. Furthermore, the marker gene database used by Lemur is only 4 GB and contains information from over 300,000 RefSeq genomes. The reference is available at https://zenodo.org/records/10802546, and the software is open-source and available at https://github.com/treangenlab/lemur.

15.
Genome Biol ; 23(1): 182, 2022 08 29.
Artículo en Inglés | MEDLINE | ID: mdl-36038949

RESUMEN

With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.


Asunto(s)
Genoma Humano , Genómica , Genómica/métodos , Humanos , Nucleótidos , Telómero/genética
16.
Comput Struct Biotechnol J ; 20: 3208-3222, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35832621

RESUMEN

Characterizing metagenomes via kmer-based, database-dependent taxonomic classification has yielded key insights into underlying microbiome dynamics. However, novel approaches are needed to track community dynamics and genomic flux within metagenomes, particularly in response to perturbations. We describe KOMB, a novel method for tracking genome level dynamics within microbiomes. KOMB utilizes K-core decomposition to identify Structural variations (SVs), specifically, population-level Copy Number Variation (CNV) within microbiomes. K-core decomposition partitions the graph into shells containing nodes of induced degree at least K, yielding reduced computational complexity compared to prior approaches. Through validation on a synthetic community, we show that KOMB recovers and profiles repetitive genomic regions in the sample. KOMB is shown to identify functionally-important regions in Human Microbiome Project datasets, and was used to analyze longitudinal data and identify keystone taxa in Fecal Microbiota Transplantation (FMT) samples. In summary, KOMB represents a novel graph-based, taxonomy-oblivious, and reference-free approach for tracking CNV within microbiomes. KOMB is open source and available for download at https://gitlab.com/treangenlab/komb.

17.
Brain Behav Immun Health ; 21: 100438, 2022 May.
Artículo en Inglés | MEDLINE | ID: mdl-35284846

RESUMEN

Concussions, both single and repetitive, cause brain and body alterations in athletes during contact sports. The role of the brain-gut connection and changes in the microbiota have not been well established after sports-related concussions or repetitive subconcussive impacts. We recruited 33 Division I Collegiate football players and collected blood, stool, and saliva samples at three time points throughout the athletic season: mid-season, following the last competitive game (post-season), and after a resting period in the off-season. Additional samples were collected from four athletes that suffered from a concussion. 16S rRNA sequencing of the gut microbiome revealed a decrease in abundance for two bacterial species, Eubacterium rectale, and Anaerostipes hadrus, after a diagnosed concussion. No significant differences were found regarding the salivary microbiome. Serum biomarker analysis shows an increase in GFAP blood levels in athletes during the competitive season. Additionally, S100ß and SAA blood levels were positively correlated with the abundance of Eubacterium rectale species among the group of athletes that did not suffer a diagnosed concussion during the sports season. These findings provide initial evidence that detecting changes in the gut microbiome may help to improve concussion diagnosis following head injury.

18.
Nat Commun ; 13(1): 1728, 2022 04 01.
Artículo en Inglés | MEDLINE | ID: mdl-35365602

RESUMEN

Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.


Asunto(s)
Aprendizaje Profundo , Biología Computacional , Filogenia , Proteínas , Biología de Sistemas
19.
Emerg Top Life Sci ; 5(6): 815-827, 2021 12 21.
Artículo en Inglés | MEDLINE | ID: mdl-34779841

RESUMEN

Associations between the human gut microbiome and expression of host illness have been noted in a variety of conditions ranging from gastrointestinal dysfunctions to neurological deficits. Machine learning (ML) methods have generated promising results for disease prediction from gut metagenomic information for diseases including liver cirrhosis and irritable bowel disease, but have lacked efficacy when predicting other illnesses. Here, we review current ML methods designed for disease classification from microbiome data. We highlight the computational challenges these methods have effectively overcome and discuss the biological components that have been overlooked to offer perspectives on future work in this area.


Asunto(s)
Microbioma Gastrointestinal , Microbiota , Humanos , Aprendizaje Automático , Metagenoma , Metagenómica/métodos
20.
ArXiv ; 2021 May 07.
Artículo en Inglés | MEDLINE | ID: mdl-33972927

RESUMEN

With recent advances in sequencing technology it has become affordable and practical to sequence genomes to very high depth-of-coverage, allowing researchers to discover low-frequency variants in the genome. However, due to the errors in sequencing it is an active area of research to develop algorithms that can separate noise from the true variants. LoFreq is a state of the art algorithm for low-frequency variant detection but has a relatively long runtime compared to other tools. In addition to this, the interface for running in parallel could be simplified, allowing for multithreading as well as distributing jobs to a cluster. In this work we describe some specific contributions to LoFreq that remedy these issues.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA