Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 84
Filter
Add more filters

Publication year range
1.
Nucleic Acids Res ; 51(D1): D690-D699, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36263822

ABSTRACT

The Comprehensive Antibiotic Resistance Database (CARD; card.mcmaster.ca) combines the Antibiotic Resistance Ontology (ARO) with curated AMR gene (ARG) sequences and resistance-conferring mutations to provide an informatics framework for annotation and interpretation of resistomes. As of version 3.2.4, CARD encompasses 6627 ontology terms, 5010 reference sequences, 1933 mutations, 3004 publications, and 5057 AMR detection models that can be used by the accompanying Resistance Gene Identifier (RGI) software to annotate genomic or metagenomic sequences. Focused curation enhancements since 2020 include expanded ß-lactamase curation, incorporation of likelihood-based AMR mutations for Mycobacterium tuberculosis, addition of disinfectants and antiseptics plus their associated ARGs, and systematic curation of resistance-modifying agents. This expanded curation includes 180 new AMR gene families, 15 new drug classes, 1 new resistance mechanism, and two new ontological relationships: evolutionary_variant_of and is_small_molecule_inhibitor. In silico prediction of resistomes and prevalence statistics of ARGs has been expanded to 377 pathogens, 21,079 chromosomes, 2,662 genomic islands, 41,828 plasmids and 155,606 whole-genome shotgun assemblies, resulting in collation of 322,710 unique ARG allele sequences. New features include the CARD:Live collection of community submitted isolate resistome data and the introduction of standardized 15 character CARD Short Names for ARGs to support machine learning efforts.


Subject(s)
Data Curation , Databases, Factual , Drug Resistance, Microbial , Machine Learning , Anti-Bacterial Agents/pharmacology , Genes, Bacterial , Likelihood Functions , Software , Molecular Sequence Annotation
2.
Environ Microbiol ; 26(1): e16566, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38149467

ABSTRACT

Trimming of sequencing reads is a pre-processing step that aims to discard sequence segments such as primers, adapters and low quality nucleotides that will interfere with clustering and classification steps. We evaluated the impact of trimming length of paired-end 16S and 18S rRNA amplicon reads on the ability to reconstruct the taxonomic composition and relative abundances of communities with a known composition in both even and uneven proportions. We found that maximizing read retention maximizes recall but reduces precision by increasing false positives. The presence of expected taxa was accurately predicted across broad trim length ranges but recovering original relative proportions remains a difficult challenge. We show that parameters that maximize taxonomic recovery do not simultaneously maximize relative abundance accuracy. Trim length represents one of several experimental parameters that have non-uniform impact across microbial clades, making it a difficult parameter to optimize. This study offers insights, guidelines, and helps researchers assess the significance of their decisions when trimming raw reads in a microbiome analysis based on overlapping or non-overlapping paired-end amplicons.


Subject(s)
Microbiota , RNA, Ribosomal, 16S/genetics , Microbiota/genetics , Sequence Analysis, DNA , RNA, Ribosomal, 18S , DNA Primers/genetics , High-Throughput Nucleotide Sequencing
3.
Syst Biol ; 72(3): 559-574, 2023 Jun 17.
Article in English | MEDLINE | ID: mdl-35904761

ABSTRACT

Organismal traits can evolve in a coordinated way, with correlated patterns of gains and losses reflecting important evolutionary associations. Discovering these associations can reveal important information about the functional and ecological linkages among traits. Phylogenetic profiles treat individual genes as traits distributed across sets of genomes and can provide a fine-grained view of the genetic underpinnings of evolutionary processes in a set of genomes. Phylogenetic profiling has been used to identify genes that are functionally linked and to identify common patterns of lateral gene transfer in microorganisms. However, comparative analysis of phylogenetic profiles and other trait distributions should take into account the phylogenetic relationships among the organisms under consideration. Here, we propose the Community Coevolution Model (CCM), a new coevolutionary model to analyze the evolutionary associations among traits, with a focus on phylogenetic profiles. In the CCM, traits are considered to evolve as a community with interactions, and the transition rate for each trait depends on the current states of other traits. Surpassing other comparative methods for pairwise trait analysis, CCM has the additional advantage of being able to examine multiple traits as a community to reveal more dependency relationships. We also develop a simulation procedure to generate phylogenetic profiles with correlated evolutionary patterns that can be used as benchmark data for evaluation purposes. A simulation study demonstrates that CCM is more accurate than other methods including the Jaccard Index and three tree-aware methods. The parameterization of CCM makes the interpretation of the relations between genes more direct, which leads to Darwin's scenario being identified easily based on the estimated parameters. We show that CCM is more efficient and fits real data better than other methods resulting in higher likelihood scores with fewer parameters. An examination of 3786 phylogenetic profiles across a set of 659 bacterial genomes highlights linkages between genes with common functions, including many patterns that would not have been identified under a nonphylogenetic model of common distribution. We also applied the CCM to 44 proteins in the well-studied Mitochondrial Respiratory Complex I and recovered associations that mapped well onto the structural associations that exist in the complex. [Coevolution; evolutionary rates; gene network; graphical models; phylogenetic profiles; phylogeny.].


Subject(s)
Biological Evolution , Proteins , Phylogeny , Phenotype , Genome, Bacterial
4.
Clin Microbiol Rev ; 35(3): e0017921, 2022 09 21.
Article in English | MEDLINE | ID: mdl-35612324

ABSTRACT

Antimicrobial resistance (AMR) is a global health crisis that poses a great threat to modern medicine. Effective prevention strategies are urgently required to slow the emergence and further dissemination of AMR. Given the availability of data sets encompassing hundreds or thousands of pathogen genomes, machine learning (ML) is increasingly being used to predict resistance to different antibiotics in pathogens based on gene content and genome composition. A key objective of this work is to advocate for the incorporation of ML into front-line settings but also highlight the further refinements that are necessary to safely and confidently incorporate these methods. The question of what to predict is not trivial given the existence of different quantitative and qualitative laboratory measures of AMR. ML models typically treat genes as independent predictors, with no consideration of structural and functional linkages; they also may not be accurate when new mutational variants of known AMR genes emerge. Finally, to have the technology trusted by end users in public health settings, ML models need to be transparent and explainable to ensure that the basis for prediction is clear. We strongly advocate that the next set of AMR-ML studies should focus on the refinement of these limitations to be able to bridge the gap to diagnostic implementation.


Subject(s)
Anti-Bacterial Agents , Drug Resistance, Bacterial , Anti-Bacterial Agents/pharmacology , Anti-Bacterial Agents/therapeutic use , Drug Resistance, Bacterial/genetics , Machine Learning
5.
Bioinformatics ; 38(11): 3051-3061, 2022 05 26.
Article in English | MEDLINE | ID: mdl-35536192

ABSTRACT

MOTIVATION: There is a plethora of measures to evaluate functional similarity (FS) of genes based on their co-expression, protein-protein interactions and sequence similarity. These measures are typically derived from hand-engineered and application-specific metrics to quantify the degree of shared information between two genes using their Gene Ontology (GO) annotations. RESULTS: We introduce deepSimDEF, a deep learning method to automatically learn FS estimation of gene pairs given a set of genes and their GO annotations. deepSimDEF's key novelty is its ability to learn low-dimensional embedding vector representations of GO terms and gene products and then calculate FS using these learned vectors. We show that deepSimDEF can predict the FS of new genes using their annotations: it outperformed all other FS measures by >5-10% on yeast and human reference datasets on protein-protein interactions, gene co-expression and sequence homology tasks. Thus, deepSimDEF offers a powerful and adaptable deep neural architecture that can benefit a wide range of problems in genomics and proteomics, and its architecture is flexible enough to support its extension to any organism. AVAILABILITY AND IMPLEMENTATION: Source code and data are available at https://github.com/ahmadpgh/deepSimDEF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Proteins , Humans , Gene Ontology , Computational Biology/methods , Molecular Sequence Annotation , Software , Saccharomyces cerevisiae , RNA
6.
BMC Microbiol ; 22(1): 270, 2022 11 10.
Article in English | MEDLINE | ID: mdl-36357861

ABSTRACT

BACKGROUND: Preterm birth is a global problem with about 12% of births in sub-Saharan Africa occurring before 37 weeks of gestation. Several studies have explored a potential association between vaginal microbiota and preterm birth, and some have found an association while others have not. We performed a study designed to determine whether there is an association with vaginal microbiota and/or placental microbiota and preterm birth in an African setting. METHODS: Women presenting to the study hospital in labor with a gestational age of 26 to 36 weeks plus six days were prospectively enrolled in a study of the microbiota in preterm labor along with controls matched for age and parity. A vaginal sample was collected at the time of presentation to the hospital in active labor. In addition, a placental sample was collected when available. Libraries were constructed using PCR primers to amplify the V6/V7/V8 variable regions of the 16S rRNA gene, followed by sequencing with an Illumina MiSeq machine and analysis using QIIME2 2022.2. RESULTS: Forty-nine women presenting with preterm labor and their controls were enrolled in the study of which 23 matched case-control pairs had sufficient sequence data for comparison. Lactobacillus was identified in all subjects, ranging in abundance from < 1% to > 99%, with Lactobacillus iners and Lactobacillus crispatus the most common species. Over half of the vaginal samples contained Gardnerella and/or Prevotella; both species were associated with preterm birth in previous studies. However, we found no significant difference in composition between mothers with preterm and those with full-term deliveries, with both groups showing roughly equal representation of different Lactobacillus species and dysbiosis-associated genera. Placental samples generally had poor DNA recovery, with a mix of probable sequencing artifacts, contamination, and bacteria acquired during passage through the birth canal. However, several placental samples showed strong evidence for the presence of Streptococcus species, which are known to infect the placenta. CONCLUSIONS: The current study showed no association of preterm birth with composition of the vaginal community. It does provide important information on the range of sequence types in African women and supports other data suggesting that women of African ancestry have an increased frequency of non-Lactobacillus types, but without evidence of associated adverse outcomes.


Subject(s)
Microbiota , Obstetric Labor, Premature , Premature Birth , Humans , Female , Infant, Newborn , Pregnancy , Infant , RNA, Ribosomal, 16S/genetics , Premature Birth/microbiology , Case-Control Studies , Kenya , Placenta , Vagina/microbiology , Obstetric Labor, Premature/microbiology , Microbiota/genetics
7.
Nucleic Acids Res ; 48(D1): D517-D525, 2020 01 08.
Article in English | MEDLINE | ID: mdl-31665441

ABSTRACT

The Comprehensive Antibiotic Resistance Database (CARD; https://card.mcmaster.ca) is a curated resource providing reference DNA and protein sequences, detection models and bioinformatics tools on the molecular basis of bacterial antimicrobial resistance (AMR). CARD focuses on providing high-quality reference data and molecular sequences within a controlled vocabulary, the Antibiotic Resistance Ontology (ARO), designed by the CARD biocuration team to integrate with software development efforts for resistome analysis and prediction, such as CARD's Resistance Gene Identifier (RGI) software. Since 2017, CARD has expanded through extensive curation of reference sequences, revision of the ontological structure, curation of over 500 new AMR detection models, development of a new classification paradigm and expansion of analytical tools. Most notably, a new Resistomes & Variants module provides analysis and statistical summary of in silico predicted resistance variants from 82 pathogens and over 100 000 genomes. By adding these resistance variants to CARD, we are able to summarize predicted resistance using the information included in CARD, identify trends in AMR mobility and determine previously undescribed and novel resistance variants. Here, we describe updates and recent expansions to CARD and its biocuration process, including new resources for community biocuration of AMR molecular reference data.


Subject(s)
Databases, Genetic , Drug Resistance, Bacterial , Genes, Bacterial , Software , Bacteria/drug effects , Bacteria/genetics , Bacterial Proteins/chemistry , Bacterial Proteins/genetics , Bacterial Proteins/metabolism
8.
Bioinformatics ; 36(10): 3043-3048, 2020 05 01.
Article in English | MEDLINE | ID: mdl-32108861

ABSTRACT

MOTIVATION: Many methods for microbial protein subcellular localization (SCL) prediction exist; however, none is readily available for analysis of metagenomic sequence data, despite growing interest from researchers studying microbial communities in humans, agri-food relevant organisms and in other environments (e.g. for identification of cell-surface biomarkers for rapid protein-based diagnostic tests). We wished to also identify new markers of water quality from freshwater samples collected from pristine versus pollution-impacted watersheds. RESULTS: We report PSORTm, the first bioinformatics tool designed for prediction of diverse bacterial and archaeal protein SCL from metagenomics data. PSORTm incorporates components of PSORTb, one of the most precise and widely used protein SCL predictors, with an automated classification by cell envelope. An evaluation using 5-fold cross-validation with in silico-fragmented sequences with known localization showed that PSORTm maintains PSORTb's high precision, while sensitivity increases proportionately with metagenomic sequence fragment length. PSORTm's read-based analysis was similar to PSORTb-based analysis of metagenome-assembled genomes (MAGs); however, the latter requires non-trivial manual classification of each MAG by cell envelope, and cannot make use of unassembled sequences. Analysis of the watershed samples revealed the importance of normalization and identified potential biomarkers of water quality. This method should be useful for examining a wide range of microbial communities, including human microbiomes, and other microbiomes of medical, environmental or industrial importance. AVAILABILITY AND IMPLEMENTATION: Documentation, source code and docker containers are available for running PSORTm locally at https://www.psort.org/psortm/ (freely available, open-source software under GNU General Public License Version 3). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Archaea , Metagenomics , Archaea/genetics , Bacteria/genetics , Humans , Metagenome , Software
9.
Microb Ecol ; 77(3): 713-725, 2019 Apr.
Article in English | MEDLINE | ID: mdl-30209585

ABSTRACT

Soil microorganisms are important mediators of carbon cycling in nature. Although cellulose- and hemicellulose-degrading bacteria have been isolated from Algerian ecosystems, the information on the composition of soil bacterial communities and thus the potential of their members to decompose plant residues is still limited. The objective of the present study was to describe and compare the bacterial community composition in Algerian soils (crop, forest, garden, and desert) and the activity of cellulose- and hemicellulose-degrading enzymes. Bacterial communities were characterized by high-throughput 16S amplicon sequencing followed by the in silico prediction of their functional potential. The highest lignocellulolytic activity was recorded in forest and garden soils whereas activities in the agricultural and desert soils were typically low. The bacterial phyla Proteobacteria (in particular classes α-proteobacteria, δ-proteobacteria, and γ-proteobacteria), Firmicutes, and Actinobacteria dominated in all soils. Forest and garden soils exhibited higher diversity than agricultural and desert soils. Endocellulase activity was elevated in forest and garden soils. In silico analysis predicted higher share of genes assigned to general metabolism in forest and garden soils compared with agricultural and arid soils, particularly in carbohydrate metabolism. The highest potential of lignocellulose decomposition was predicted for forest soils, which is in agreement with the highest activity of corresponding enzymes.


Subject(s)
Bacteria/enzymology , Bacterial Proteins/metabolism , Cellulase/metabolism , Glycoside Hydrolases/metabolism , Soil Microbiology , Soil/chemistry , Algeria , Bacteria/classification , Bacteria/genetics , Bacteria/isolation & purification , Bacterial Proteins/genetics , Cellulase/genetics , Ecosystem , Forests , Glycoside Hydrolases/genetics , Phylogeny
10.
Mol Ecol ; 27(20): 4026-4040, 2018 10.
Article in English | MEDLINE | ID: mdl-30152128

ABSTRACT

Conservation of exploited species requires an understanding of both genetic diversity and the dominant structuring forces, particularly near range limits, where climatic variation can drive rapid expansions or contractions of geographic range. Here, we examine population structure and landscape associations in Atlantic salmon (Salmo salar) across a heterogeneous landscape near the northern range limit in Labrador, Canada. Analysis of two amplicon-based data sets containing 101 microsatellites and 376 single nucleotide polymorphisms (SNPs) from 35 locations revealed clear differentiation between populations spawning in rivers flowing into a large marine embayment (Lake Melville) compared to coastal populations. The mechanisms influencing the differentiation of embayment populations were investigated using both multivariate and machine-learning landscape genetic approaches. We identified temperature as the strongest correlate with genetic structure, particularly warm temperature extremes and wider annual temperature ranges. The genomic basis of this divergence was further explored using a subset of locations (n = 17) and a 220K SNP array. SNPs associated with spatial structuring and temperature mapped to a diverse set of genes and molecular pathways, including regulation of gene expression, immune response, and cell development and differentiation. The results spanning molecular marker types and both novel and established methods clearly show climate-associated, fine-scale population structure across an environmental gradient in Atlantic salmon near its range limit in North America, highlighting valuable approaches for predicting population responses to climate change and managing species sustainability.


Subject(s)
Genetics, Population/methods , Microsatellite Repeats/genetics , Salmo salar/genetics , Animals , North America , Polymorphism, Single Nucleotide/genetics
11.
Nature ; 492(7427): 59-65, 2012 Dec 06.
Article in English | MEDLINE | ID: mdl-23201678

ABSTRACT

Cryptophyte and chlorarachniophyte algae are transitional forms in the widespread secondary endosymbiotic acquisition of photosynthesis by engulfment of eukaryotic algae. Unlike most secondary plastid-bearing algae, miniaturized versions of the endosymbiont nuclei (nucleomorphs) persist in cryptophytes and chlorarachniophytes. To determine why, and to address other fundamental questions about eukaryote-eukaryote endosymbiosis, we sequenced the nuclear genomes of the cryptophyte Guillardia theta and the chlorarachniophyte Bigelowiella natans. Both genomes have >21,000 protein genes and are intron rich, and B. natans exhibits unprecedented alternative splicing for a single-celled organism. Phylogenomic analyses and subcellular targeting predictions reveal extensive genetic and biochemical mosaicism, with both host- and endosymbiont-derived genes servicing the mitochondrion, the host cell cytosol, the plastid and the remnant endosymbiont cytosol of both algae. Mitochondrion-to-nucleus gene transfer still occurs in both organisms but plastid-to-nucleus and nucleomorph-to-nucleus transfers do not, which explains why a small residue of essential genes remains locked in each nucleomorph.


Subject(s)
Cell Nucleus/genetics , Cercozoa/genetics , Cryptophyta/genetics , Evolution, Molecular , Genome/genetics , Mosaicism , Symbiosis/genetics , Algal Proteins/genetics , Algal Proteins/metabolism , Alternative Splicing/genetics , Cercozoa/cytology , Cercozoa/metabolism , Cryptophyta/cytology , Cryptophyta/metabolism , Cytosol/metabolism , Gene Duplication/genetics , Gene Transfer, Horizontal/genetics , Genes, Essential/genetics , Genome, Mitochondrial/genetics , Genome, Plant/genetics , Genome, Plastid/genetics , Molecular Sequence Data , Phylogeny , Protein Transport , Proteome/genetics , Proteome/metabolism , Transcriptome/genetics
12.
Bioinformatics ; 32(9): 1380-7, 2016 05 01.
Article in English | MEDLINE | ID: mdl-26708333

ABSTRACT

MOTIVATION: Measures of protein functional similarity are essential tools for function prediction, evaluation of protein-protein interactions (PPIs) and other applications. Several existing methods perform comparisons between proteins based on the semantic similarity of their GO terms; however, these measures are highly sensitive to modifications in the topological structure of GO, tend to be focused on specific analytical tasks and concentrate on the GO terms themselves rather than considering their textual definitions. RESULTS: We introduce simDEF, an efficient method for measuring semantic similarity of GO terms using their GO definitions, which is based on the Gloss Vector measure commonly used in natural language processing. The simDEF approach builds optimized definition vectors for all relevant GO terms, and expresses the similarity of a pair of proteins as the cosine of the angle between their definition vectors. Relative to existing similarity measures, when validated on a yeast reference database, simDEF improves correlation with sequence homology by up to 50%, shows a correlation improvement >4% with gene expression in the biological process hierarchy of GO and increases PPI predictability by > 2.5% in F1 score for molecular function hierarchy. AVAILABILITY AND IMPLEMENTATION: Datasets, results and source code are available at http://kiwi.cs.dal.ca/Software/simDEF CONTACT: ahmad.pgh@dal.ca or beiko@cs.dal.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Gene Ontology , Algorithms , Animals , Humans , Proteins , Semantics
13.
BMC Genomics ; 16: 526, 2015 Jul 16.
Article in English | MEDLINE | ID: mdl-26173980

ABSTRACT

BACKGROUND: Lateral gene transfer (LGT) is an important evolutionary process in microbial evolution. In sewage treatment plants, LGT of antibiotic resistance and xenobiotic degradation-related proteins has been suggested, but the role of LGT outside these processes is unknown. Microbial communities involved in Enhanced Biological Phosphorus Removal (EBPR) have been used to treat wastewater in the last 50 years and may provide insights into adaptation to an engineered environment. We introduce two different types of analysis to identify LGT in EBPR sewage communities, based on identifying assembled sequences with more than one strong taxonomic match, and on unusual phylogenetic patterns. We applied these methods to investigate the role of LGT in six energy-related metabolic pathways. RESULTS: The analyses identified overlapping but non-identical sets of transferred enzymes. All of these were homologous with sequences from known mobile genetic elements, and many were also in close proximity to transposases and integrases in the EBPR data set. The taxonomic method had higher sensitivity than the phylogenetic method, identifying more potential LGTs. Both analyses identified the putative transfer of five enzymes within an Australian community, two in a Danish community, and none in a US-derived culture. CONCLUSIONS: Our methods were able to identify sequences with unusual phylogenetic or compositional properties as candidate LGT events. The association of these candidates with known mobile elements supports the hypothesis of transfer. The results of our analysis strongly suggest that LGT has influenced the development of functionally important energy-related pathways in EBPR systems, but transfers may be unique to each community due to different operating conditions or taxonomic composition.


Subject(s)
Gene Transfer, Horizontal , Phosphorus/metabolism , Bacteria/enzymology , Bacteria/genetics , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Contig Mapping , Energy Metabolism/genetics , Enzymes/genetics , Enzymes/metabolism , Sewage/microbiology
14.
Bioinformatics ; 30(21): 3123-4, 2014 Nov 01.
Article in English | MEDLINE | ID: mdl-25061070

ABSTRACT

UNLABELLED: STAMP is a graphical software package that provides statistical hypothesis tests and exploratory plots for analysing taxonomic and functional profiles. It supports tests for comparing pairs of samples or samples organized into two or more treatment groups. Effect sizes and confidence intervals are provided to allow critical assessment of the biological relevancy of test results. A user-friendly graphical interface permits easy exploration of statistical results and generation of publication-quality plots. AVAILABILITY AND IMPLEMENTATION: STAMP is licensed under the GNU GPL. Python source code and binaries are available from our website at: http://kiwi.cs.dal.ca/Software/STAMP.


Subject(s)
Bacteria/classification , Software , Classification/methods , Confidence Intervals , Cyanobacteria/classification , Cyanobacteria/genetics , Data Interpretation, Statistical , Genome, Bacterial , Humans
15.
Syst Biol ; 63(4): 566-81, 2014 Jul.
Article in English | MEDLINE | ID: mdl-24695589

ABSTRACT

Supertree methods reconcile a set of phylogenetic trees into a single structure that is often interpreted as a branching history of species. A key challenge is combining conflicting evolutionary histories that are due to artifacts of phylogenetic reconstruction and phenomena such as lateral gene transfer (LGT). Many supertree approaches use optimality criteria that do not reflect underlying processes, have known biases, and may be unduly influenced by LGT. We present the first method to construct supertrees by using the subtree prune-and-regraft (SPR) distance as an optimality criterion. Although calculating the rooted SPR distance between a pair of trees is NP-hard, our new maximum agreement forest-based methods can reconcile trees with hundreds of taxa and>50 transfers in fractions of a second, which enables repeated calculations during the course of an iterative search. Our approach can accommodate trees in which uncertain relationships have been collapsed to multifurcating nodes. Using a series of benchmark datasets simulated under plausible rates of LGT, we show that SPR supertrees are more similar to correct species histories than supertrees based on parsimony or Robinson-Foulds distance criteria. We successfully constructed an SPR supertree from a phylogenomic dataset of 40,631 gene trees that covered 244 genomes representing several major bacterial phyla. Our SPR-based approach also allowed direct inference of highways of gene transfer between bacterial classes and genera. A Small number of these highways connect genera in different phyla and can highlight specific genes implicated in long-distance LGT. [Lateral gene transfer; matrix representation with parsimony; phylogenomics; prokaryotic phylogeny; Robinson-Foulds; subtree prune-and-regraft; supertrees.].


Subject(s)
Bacteria/classification , Classification/methods , Computer Simulation , Phylogeny , Algorithms , Bacteria/genetics , Gene Transfer, Horizontal , Genome, Bacterial/genetics , Reproducibility of Results
16.
Bioinformatics ; 29(15): 1858-64, 2013 Aug 01.
Article in English | MEDLINE | ID: mdl-23732273

ABSTRACT

BACKGROUND: Homology-based taxonomic assignment is impeded by differences between the unassigned read and reference database, forcing a rank-specific classification to the closest (and possibly incorrect) reference lineage. This assignment may be correct only to a general rank (e.g. order) and incorrect below that rank (e.g. family and genus). Algorithms like LCA avoid this by varying the predicted taxonomic rank based on matches to a set of taxonomic references. LCA and related approaches can be conservative, especially if best matches are taxonomically widespread because of events such as lateral gene transfer (LGT). RESULTS: Our extension to LCA called SPANNER (similarity profile annotater) uses the set of best homology matches (the LCA Profile) for a given sequence and compares this profile with a set of profiles inferred from taxonomic reference organisms. SPANNER provides an assignment that is less sensitive to LGT and other confounding phenomena. In a series of trials on real and artificial datasets, SPANNER outperformed LCA-style algorithms in terms of taxonomic precision and outperformed best BLAST at certain levels of taxonomic novelty in the dataset. We identify examples where LCA made an overly conservative prediction, but SPANNER produced a more precise and correct prediction. CONCLUSIONS: By using profiles of homology matches to represent patterns of genomic similarity that arise because of vertical and lateral inheritance, SPANNER offers an effective compromise between taxonomic assignment based on best BLAST scores, and the conservative approach of LCA and similar approaches. AVAILABILITY: C++ source code and binaries are freely available at http://kiwi.cs.dal.ca/Software/SPANNER. CONTACT: beiko@cs.dal.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Genome, Microbial , Sequence Alignment/methods , Genomics/methods , Metagenome , Phylogeny
17.
Nucleic Acids Res ; 40(14): e111, 2012 Aug.
Article in English | MEDLINE | ID: mdl-22532608

ABSTRACT

Determining the taxonomic lineage of DNA sequences is an important step in metagenomic analysis. Short DNA fragments from next-generation sequencing projects and microbes that lack close relatives in reference sequenced genome databases pose significant problems to taxonomic attribution methods. Our new classification algorithm, RITA (Rapid Identification of Taxonomic Assignments), uses the agreement between composition and homology to accurately classify sequences as short as 50 nt in length by assigning them to different classification groups with varying degrees of confidence. RITA is much faster than the hybrid PhymmBL approach when comparable homology search algorithms are used, and achieves slightly better accuracy than PhymmBL on an artificial metagenome. RITA can also incorporate prior knowledge about taxonomic distributions to increase the accuracy of assignments in data sets with varying degrees of taxonomic novelty, and classified sequences with higher precision than the current best rank-flexible classifier. The accuracy on short reads can be increased by exploiting paired-end information, if available, which we demonstrate on a recently published bovine rumen data set. Finally, we develop a variant of RITA that incorporates accelerated homology search techniques, and generate predictions on a set of human gut metagenomes that were previously assigned to different 'enterotypes'. RITA is freely available in Web server and standalone versions.


Subject(s)
Algorithms , Metagenomics/methods , Sequence Analysis, DNA , Animals , Cattle , Classification/methods , Humans , Ice Cover/microbiology , Metagenome , Rumen/microbiology , Sequence Homology, Nucleic Acid , Stomach/microbiology
18.
Mol Biol Evol ; 29(12): 3947-58, 2012 Dec.
Article in English | MEDLINE | ID: mdl-22915830

ABSTRACT

Environmental drivers of biodiversity can be identified by relating patterns of community similarity to ecological factors. Community variation has traditionally been assessed by considering changes in species composition and more recently by incorporating phylogenetic information to account for the relative similarity of taxa. Here, we describe how an important class of measures including Bray-Curtis, Canberra, and UniFrac can be extended to allow community variation to be computed on a phylogenetic network. We focus on phylogenetic split systems, networks that are produced by the widely used median network and neighbor-net methods, which can represent incongruence in the evolutionary history of a set of taxa. Calculating ß diversity over a split system provides a measure of community similarity averaged over uncertainty or conflict in the available phylogenetic signal. Our freely available software, Network Diversity, provides 11 qualitative (presence-absence, unweighted) and 14 quantitative (weighted) network-based measures of community similarity that model different aspects of community richness and evenness. We demonstrate the broad applicability of network-based diversity approaches by applying them to three distinct data sets: pneumococcal isolates from distinct geographic regions, human mitochondrial DNA data from the Indonesian island of Nias, and proteorhodopsin sequences from the Sargasso and Mediterranean Seas. Our results show that major expected patterns of variation for these data sets are recovered using network-based measures, which indicates that these patterns are robust to phylogenetic uncertainty and conflict. Nonetheless, network-based measures of community similarity can differ substantially from measures ignoring phylogenetic relationships or from tree-based measures when incongruent signals are present in the underlying data. Network-based measures provide a methodology for assessing the robustness of ß-diversity results in light of incongruent phylogenetic signal and allow ß diversity to be calculated over widely used network structures such as median networks.


Subject(s)
Biodiversity , Biota , Genetic Variation , Models, Theoretical , Phylogeny , Software , DNA, Mitochondrial/genetics , Genetics, Population/methods , Humans , Indonesia , Multilocus Sequence Typing , Rhodopsin/genetics , Rhodopsins, Microbial , Streptococcus pneumoniae/genetics
19.
bioRxiv ; 2023 Aug 14.
Article in English | MEDLINE | ID: mdl-37609252

ABSTRACT

Lateral gene transfer (LGT) is an important mechanism for genome diversification in microbial populations, including the human microbiome. While prior work has surveyed LGT events in human-associated microbial isolate genomes, the scope and dynamics of novel LGT events arising in personal microbiomes are not well understood, as there are no widely adopted computational methods to detect, quantify, and characterize LGT from complex microbial communities. We addressed this by developing, benchmarking, and experimentally validating a computational method (WAAFLE) to profile novel LGT events from assembled metagenomes. Applying WAAFLE to >2K human metagenomes from diverse body sites, we identified >100K putative high-confidence but previously uncharacterized LGT events (~2 per assembled microbial genome-equivalent). These events were enriched for mobile elements (as expected), as well as restriction-modification and transport functions typically associated with the destruction of foreign DNA. LGT frequency was quantifiably influenced by biogeography, the phylogenetic similarity of the involved taxa, and the ecological abundance of the donor taxon. These forces manifest as LGT networks in which hub species abundant in a community type donate unequally with their close phylogenetic neighbors. Our findings suggest that LGT may be a more ubiquitous process in the human microbiome than previously described. The open-source WAAFLE implementation, documentation, and data from this work are available at http://huttenhower.sph.harvard.edu/waafle.

20.
Sci Rep ; 13(1): 5210, 2023 03 30.
Article in English | MEDLINE | ID: mdl-36997631

ABSTRACT

Using environmental DNA (eDNA) to monitor biodiversity in aquatic environments is becoming an efficient and cost-effective alternative to other methods such as visual and acoustic identification. Until recently, eDNA sampling was accomplished primarily through manual sampling methods; however, with technological advances, automated samplers are being developed to make sampling easier and more accessible. This paper describes a new eDNA sampler capable of self-cleaning and multi-sample capture and preservation, all within a single unit capable of being deployed by a single person. The first in-field test of this sampler took place in the Bedford Basin, Nova Scotia, Canada alongside parallel samples taken using the typical Niskin bottle collection and post-collection filtration method. Both methods were able to capture the same aquatic microbial community and counts of representative DNA sequences were well correlated between methods with R[Formula: see text] values ranging from 0.71-0.93. The two collection methods returned the same top 10 families in near identical relative abundance, demonstrating that the sampler was able to capture the same community composition of common microbes as the Niskin. The presented eDNA sampler provides a robust alternative to manual sampling methods, is amenable to autonomous vehicle payload constraints, and will facilitate persistent monitoring of remote and inaccessible sites.


Subject(s)
DNA, Environmental , Microbiota , Humans , DNA, Environmental/genetics , Biodiversity , Filtration , Microbiota/genetics , Nova Scotia , Environmental Monitoring/methods , DNA Barcoding, Taxonomic/methods
SELECTION OF CITATIONS
SEARCH DETAIL