Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 12 de 12
Filter
Add more filters










Publication year range
1.
mSystems ; 6(2)2021 Mar 16.
Article in English | MEDLINE | ID: mdl-33727399

ABSTRACT

Standard workflows for analyzing microbiomes often include the creation and curation of phylogenetic trees. Here we present EMPress, an interactive web tool for visualizing trees in the context of microbiome, metabolome, and other community data scalable to trees with well over 500,000 nodes. EMPress provides novel functionality-including ordination integration and animations-alongside many standard tree visualization features and thus simplifies exploratory analyses of many forms of 'omic data.IMPORTANCE Phylogenetic trees are integral data structures for the analysis of microbial communities. Recent work has also shown the utility of trees constructed from certain metabolomic data sets, further highlighting their importance in microbiome research. The ever-growing scale of modern microbiome surveys has led to numerous challenges in visualizing these data. In this paper we used five diverse data sets to showcase the versatility and scalability of EMPress, an interactive web visualization tool. EMPress addresses the growing need for exploratory analysis tools that can accommodate large, complex multi-omic data sets.

2.
Mol Phylogenet Evol ; 151: 106892, 2020 10.
Article in English | MEDLINE | ID: mdl-32562819

ABSTRACT

Sabellida is a well-known clade containing tube-dwelling annelid worms with a radiolar crown. Iterative phylogenetic analyses over three decades have resulted in three main clades being recognized; Fabriciidae, Serpulidae and Sabellidae, with Fabriciidae proposed as the sister group to Serpulidae. However, relationships within Sabellidae have remained poorly understood, with a proliferation of genera. In order to obtain a robust phylogeny with optimal support, we conducted a large-scale phylogenomic analysis with 19 new sabellid transcriptomes for a total of 21 species. In contrast to earlier findings based on limited DNA data, our results support the position of Fabriciidae as sister taxon to a Sabellidae + Serpulidae clade. Our large sampling within Sabellidae also allows us to establish a stable phylogeny within this clade. We restrict Sabellinae to a subclade of Sabellidae and broaden the previously monotypic Myxicolinae to include Amphicorina and Chone. We tested the robustness of species tree reconstruction by subsampling increasing numbers of genes to uncover hidden support of alternative topologies. Our results show that inclusion of more genes leads to a more stable topology with higher support, and also that including higher divergence genes leads to stronger resolution.


Subject(s)
Annelida/genetics , Genetic Loci , Phylogeny , Animals , Annelida/classification , Data Analysis , Likelihood Functions , Species Specificity , Transcriptome/genetics
3.
Nat Commun ; 10(1): 5477, 2019 12 02.
Article in English | MEDLINE | ID: mdl-31792218

ABSTRACT

Rapid growth of genome data provides opportunities for updating microbial evolutionary relationships, but this is challenged by the discordant evolution of individual genes. Here we build a reference phylogeny of 10,575 evenly-sampled bacterial and archaeal genomes, based on a comprehensive set of 381 markers, using multiple strategies. Our trees indicate remarkably closer evolutionary proximity between Archaea and Bacteria than previous estimates that were limited to fewer "core" genes, such as the ribosomal proteins. The robustness of the results was tested with respect to several variables, including taxon and site sampling, amino acid substitution heterogeneity and saturation, non-vertical evolution, and the impact of exclusion of candidate phyla radiation (CPR) taxa. Our results provide an updated view of domain-level relationships.


Subject(s)
Archaea/classification , Bacteria/classification , Evolution, Molecular , Genome, Archaeal , Genome, Bacterial , Phylogeny , Archaea/genetics , Bacteria/genetics
4.
Bioinformatics ; 35(14): i31-i40, 2019 07 15.
Article in English | MEDLINE | ID: mdl-31510701

ABSTRACT

MOTIVATION: Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data are high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data are often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks. RESULTS: In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes. AVAILABILITY AND IMPLEMENTATION: TADA is available at https://github.com/tada-alg/TADA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Microbiota , Phylogeny , Machine Learning , Phenotype
5.
Mol Phylogenet Evol ; 130: 286-296, 2019 01.
Article in English | MEDLINE | ID: mdl-30393186

ABSTRACT

Genome-wide phylogeny reconstruction is becoming increasingly common, and one driving factor behind these phylogenomic studies is the promise that the potential discordance between gene trees and the species tree can be modeled. Incomplete lineage sorting is one cause of discordance that bridges population genetic and phylogenetic processes. ASTRAL is a species tree reconstruction method that seeks to find the tree with minimum quartet distance to an input set of inferred gene trees. However, the published ASTRAL algorithm only works with one sample per species. To account for polymorphisms in present-day species, one can sample multiple individuals per species to create multi-allele datasets. Here, we introduce how ASTRAL can handle multi-allele datasets. We show that the quartet-based optimization problem extends naturally, and we introduce heuristic methods for building the search space specifically for the case of multi-individual datasets. We study the accuracy and scalability of the multi-individual version of ASTRAL-III using extensive simulation studies and compare it to NJst, the only other scalable method that can handle these datasets. We do not find strong evidence that using multiple individuals dramatically improves accuracy. When we study the trade-off between sampling more genes versus more individuals, we find that sampling more genes is more effective than sampling more individuals, even under conditions that we study where trees are shallow (median length: ≈1Ne) and ILS is extremely high.


Subject(s)
Alleles , Genomics/methods , Phylogeny , Algorithms , Computer Simulation , Databases, Genetic , Species Specificity
6.
BMC Bioinformatics ; 19(Suppl 6): 153, 2018 05 08.
Article in English | MEDLINE | ID: mdl-29745866

ABSTRACT

BACKGROUND: Evolutionary histories can be discordant across the genome, and such discordances need to be considered in reconstructing the species phylogeny. ASTRAL is one of the leading methods for inferring species trees from gene trees while accounting for gene tree discordance. ASTRAL uses dynamic programming to search for the tree that shares the maximum number of quartet topologies with input gene trees, restricting itself to a predefined set of bipartitions. RESULTS: We introduce ASTRAL-III, which substantially improves the running time of ASTRAL-II and guarantees polynomial running time as a function of both the number of species (n) and the number of genes (k). ASTRAL-III limits the bipartition constraint set (X) to grow at most linearly with n and k. Moreover, it handles polytomies more efficiently than ASTRAL-II, exploits similarities between gene trees better, and uses several techniques to avoid searching parts of the search space that are mathematically guaranteed not to include the optimal tree. The asymptotic running time of ASTRAL-III in the presence of polytomies is [Formula: see text] where D=O(nk) is the sum of degrees of all unique nodes in input trees. The running time improvements enable us to test whether contracting low support branches in gene trees improves the accuracy by reducing noise. In extensive simulations, we show that removing branches with very low support (e.g., below 10%) improves accuracy while overly aggressive filtering is harmful. We observe on a biological avian phylogenomic dataset of 14K genes that contracting low support branches greatly improve results. CONCLUSIONS: ASTRAL-III is a faster version of the ASTRAL method for phylogenetic reconstruction and can scale up to 10,000 species. With ASTRAL-III, low support branches can be removed, resulting in improved accuracy.


Subject(s)
Algorithms , Phylogeny , Animals , Birds/classification , Birds/genetics , Computer Simulation , Databases, Genetic , Models, Genetic , Species Specificity , Time Factors
7.
Genes (Basel) ; 9(3)2018 Feb 28.
Article in English | MEDLINE | ID: mdl-29495636

ABSTRACT

Phylogenetic species trees typically represent the speciation history as a bifurcating tree. Speciation events that simultaneously create more than two descendants, thereby creating polytomies in the phylogeny, are possible. Moreover, the inability to resolve relationships is often shown as a (soft) polytomy. Both types of polytomies have been traditionally studied in the context of gene tree reconstruction from sequence data. However, polytomies in the species tree cannot be detected or ruled out without considering gene tree discordance. In this paper, we describe a statistical test based on properties of the multi-species coalescent model to test the null hypothesis that a branch in an estimated species tree should be replaced by a polytomy. On both simulated and biological datasets, we show that the null hypothesis is rejected for all but the shortest branches, and in most cases, it is retained for true polytomies. The test, available as part of the Accurate Species TRee ALgorithm (ASTRAL) package, can help systematists decide whether their datasets are sufficient to resolve specific relationships of interest.

8.
Mol Phylogenet Evol ; 122: 110-115, 2018 05.
Article in English | MEDLINE | ID: mdl-29421312

ABSTRACT

Phylogenomics has ushered in an age of discordance. Analyses often reveal abundant discordances among phylogenies of different parts of genomes, as well as incongruences between species trees obtained using different methods or data partitions. Researchers are often left trying to make sense of such incongruences. Interpretive ways of measuring and visualizing discordance are needed, both among alternative species trees and gene trees, especially for specific focal branches of a tree. Here, we introduce DiscoVista, a publicly available tool that creates a suite of simple but interpretable visualizations. DiscoVista helps quantify the amount of discordance and some of its potential causes.


Subject(s)
Classification/methods , Software , Genome , Models, Genetic , Phylogeny
9.
Mol Biol Evol ; 34(12): 3279-3291, 2017 Dec 01.
Article in English | MEDLINE | ID: mdl-29029241

ABSTRACT

Species tree reconstruction from genome-wide data is increasingly being attempted, in most cases using a two-step approach of first estimating individual gene trees and then summarizing them to obtain a species tree. The accuracy of this approach, which promises to account for gene tree discordance, depends on the quality of the inferred gene trees. At the same time, phylogenomic and phylotranscriptomic analyses typically use involved bioinformatics pipelines for data preparation. Errors and shortcomings resulting from these preprocessing steps may impact the species tree analyses at the other end of the pipeline. In this article, we first show that the presence of fragmentary data for some species in a gene alignment, as often seen on real data, can result in substantial deterioration of gene trees, and as a result, the species tree. We then investigate a simple filtering strategy where individual fragmentary sequences are removed from individual genes but the rest of the gene is retained. Both in simulations and by reanalyzing a large insect phylotranscriptomic data set, we show the effectiveness of this simple filtering strategy.


Subject(s)
Genomics/methods , Phylogeny , Sequence Analysis, Protein/methods , Algorithms , Animals , Computer Simulation , Genetic Speciation , Genome , Insecta/genetics , Models, Genetic , Peptide Fragments/genetics
10.
PLoS One ; 12(8): e0182238, 2017.
Article in English | MEDLINE | ID: mdl-28800608

ABSTRACT

Phylogenetic trees inferred using commonly-used models of sequence evolution are unrooted, but the root position matters both for interpretation and downstream applications. This issue has been long recognized; however, whether the potential for discordance between the species tree and gene trees impacts methods of rooting a phylogenetic tree has not been extensively studied. In this paper, we introduce a new method of rooting a tree based on its branch length distribution; our method, which minimizes the variance of root to tip distances, is inspired by the traditional midpoint rerooting and is justified when deviations from the strict molecular clock are random. Like midpoint rerooting, the method can be implemented in a linear time algorithm. In extensive simulations that consider discordance between gene trees and the species tree, we show that the new method is more accurate than midpoint rerooting, but its relative accuracy compared to using outgroups to root gene trees depends on the size of the dataset and levels of deviations from the strict clock. We show high levels of error for all methods of rooting estimated gene trees due to factors that include effects of gene tree discordance, deviations from the clock, and gene tree estimation error. Our simulations, however, did not reveal significant differences between two equivalent methods for species tree estimation that use rooted and unrooted input, namely, STAR and NJst. Nevertheless, our results point to limitations of existing scalable rooting methods.


Subject(s)
Algorithms , Phylogeny , Computer Simulation , Databases as Topic , Genes , Species Specificity
11.
Mol Biol Evol ; 33(7): 1654-68, 2016 07.
Article in English | MEDLINE | ID: mdl-27189547

ABSTRACT

Species tree reconstruction is complicated by effects of incomplete lineage sorting, commonly modeled by the multi-species coalescent model (MSC). While there has been substantial progress in developing methods that estimate a species tree given a collection of gene trees, less attention has been paid to fast and accurate methods of quantifying support. In this article, we propose a fast algorithm to compute quartet-based support for each branch of a given species tree with regard to a given set of gene trees. We then show how the quartet support can be used in the context of the MSC to compute (1) the local posterior probability (PP) that the branch is in the species tree and (2) the length of the branch in coalescent units. We evaluate the precision and recall of the local PP on a wide set of simulated and biological datasets, and show that it has very high precision and improved recall compared with multi-locus bootstrapping. The estimated branch lengths are highly accurate when gene tree estimation error is low, but are underestimated when gene tree estimation error increases. Computation of both the branch length and local PP is implemented as new features in ASTRAL.


Subject(s)
Computational Biology/methods , Genomics/methods , Models, Genetic , Algorithms , Bayes Theorem , Computer Simulation , Genetic Speciation , Phylogeny , Probability
12.
BMC Genomics ; 17(Suppl 10): 783, 2016 11 11.
Article in English | MEDLINE | ID: mdl-28185574

ABSTRACT

BACKGROUND: Inferring species trees from gene trees using the coalescent-based summary methods has been the subject of much attention, yet new scalable and accurate methods are needed. RESULTS: We introduce DISTIQUE, a new statistically consistent summary method for inferring species trees from gene trees under the coalescent model. We generalize our results to arbitrary phylogenetic inference problems; we show that two arbitrarily chosen leaves, called anchors, can be used to estimate relative distances between all other pairs of leaves by inferring relevant quartet trees. This results in a family of distance-based tree inference methods, with running times ranging between quadratic to quartic in the number of leaves. CONCLUSIONS: We show in simulated studies that DISTIQUE has comparable accuracy to leading coalescent-based summary methods and reduced running times.


Subject(s)
Algorithms , Animals , Birds/classification , Birds/genetics , Databases, Genetic , Mammals/classification , Mammals/genetics , Models, Genetic , Phylogeny
SELECTION OF CITATIONS
SEARCH DETAIL
...