Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 115
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Nature ; 629(8013): 851-860, 2024 May.
Article in English | MEDLINE | ID: mdl-38560995

ABSTRACT

Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1-3. Here we address these issues by analysing the genomes of 363 bird species4 (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous-Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous-Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.


Subject(s)
Birds , Evolution, Molecular , Genome , Phylogeny , Animals , Birds/genetics , Birds/classification , Birds/anatomy & histology , Brain/anatomy & histology , Extinction, Biological , Genome/genetics , Genomics , Population Density , Male , Female
2.
Mol Biol Evol ; 41(1)2024 Jan 03.
Article in English | MEDLINE | ID: mdl-38124381

ABSTRACT

MOTIVATION: Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical. RESULTS: Our results show that we can distinguish between empirical and simulated MSAs with high accuracy using two distinct and independently developed classification approaches across all tested models of sequence evolution. Our findings suggest that the current state-of-the-art models fail to accurately replicate several aspects of empirical MSAs, including site-wise rates as well as amino acid and nucleotide composition.


Subject(s)
Neural Networks, Computer , Proteins , Phylogeny , Sequence Alignment , Proteins/genetics , DNA/genetics , Software
3.
Bioinformatics ; 40(4)2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38514421

ABSTRACT

MOTIVATION: Genomes are a rich source of information on the pattern and process of evolution across biological scales. How best to make use of that information is an active area of research in phylogenetics. Ideally, phylogenetic methods should not only model substitutions along gene trees, which explain differences between homologous gene sequences, but also the processes that generate the gene trees themselves along a shared species tree. To conduct accurate inferences, one needs to account for uncertainty at both levels, that is, in gene trees estimated from inherently short sequences and in their diverse evolutionary histories along a shared species tree. RESULTS: We present AleRax, a software that can infer reconciled gene trees together with a shared species tree using a simple, yet powerful, probabilistic model of gene duplication, transfer, and loss. A key feature of AleRax is its ability to account for uncertainty in the gene tree and its reconciliation by using an efficient approximation to calculate the joint phylogenetic-reconciliation likelihood and sample reconciled gene trees accordingly. Simulations and analyses of empirical data show that AleRax is one order of magnitude faster than competing gene tree inference tools while attaining the same accuracy. It is consistently more robust than species tree inference methods such as SpeciesRax and ASTRAL-Pro 2 under gene tree uncertainty. Finally, AleRax can process multiple gene families in parallel thereby allowing users to compare competing phylogenetic hypotheses and estimate model parameters, such as duplication, transfer, and loss probabilities for genome-scale datasets with hundreds of taxa. AVAILABILITY AND IMPLEMENTATION: GNU GPL at https://github.com/BenoitMorel/AleRax and data are made available at https://cme.h-its.org/exelixis/material/alerax_data.tar.gz.


Subject(s)
Algorithms , Gene Duplication , Phylogeny , Software , Models, Statistical , Evolution, Molecular
4.
Mol Biol Evol ; 40(10)2023 10 04.
Article in English | MEDLINE | ID: mdl-37804116

ABSTRACT

Phylogenetic inferences under the maximum likelihood criterion deploy heuristic tree search strategies to explore the vast search space. Depending on the input dataset, searches from different starting trees might all converge to a single tree topology. Often, though, distinct searches infer multiple topologies with large log-likelihood score differences or yield topologically highly distinct, yet almost equally likely, trees. Recently, Haag et al. introduced an approach to quantify, and implemented machine learning methods to predict, the dataset difficulty with respect to phylogenetic inference. Easy multiple sequence alignments (MSAs) exhibit a single likelihood peak on their likelihood surface, associated with a single tree topology to which most, if not all, independent searches rapidly converge. As difficulty increases, multiple locally optimal likelihood peaks emerge, yet from highly distinct topologies. To make use of this information, we introduce and implement an adaptive tree search heuristic in RAxML-NG, which modifies the thoroughness of the tree search strategy as a function of the predicted difficulty. Our adaptive strategy is based upon three observations. First, on easy datasets, searches converge rapidly and can hence be terminated at an earlier stage. Second, overanalyzing difficult datasets is hopeless, and thus it suffices to quickly infer only one of the numerous almost equally likely topologies to reduce overall execution time. Third, more extensive searches are justified and required on datasets with intermediate difficulty. While the likelihood surface exhibits multiple locally optimal peaks in this case, a small proportion of them is significantly better. Our experimental results for the adaptive heuristic on 9,515 empirical and 5,000 simulated datasets with varying difficulty exhibit substantial speedups, especially on easy and difficult datasets (53% of total MSAs), where we observe average speedups of more than 10×. Further, approximately 94% of the inferred trees using the adaptive strategy are statistically indistinguishable from the trees inferred under the standard strategy (RAxML-NG).


Subject(s)
Algorithms , Phylogeny , Likelihood Functions , Sequence Alignment
5.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36576010

ABSTRACT

MOTIVATION: Missing data and incomplete lineage sorting (ILS) are two major obstacles to accurate species tree inference. Gene tree summary methods such as ASTRAL and ASTRID have been developed to account for ILS. However, they can be severely affected by high levels of missing data. RESULTS: We present Asteroid, a novel algorithm that infers an unrooted species tree from a set of unrooted gene trees. We show on both empirical and simulated datasets that Asteroid is substantially more accurate than ASTRAL and ASTRID for very high proportions (>80%) of missing data. Asteroid is several orders of magnitude faster than ASTRAL for datasets that contain thousands of genes. It offers advanced features such as parallelization, support value computation and support for multi-copy and multifurcating gene trees. AVAILABILITY AND IMPLEMENTATION: Asteroid is freely available at https://github.com/BenoitMorel/Asteroid. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genetic Speciation , Genomics , Phylogeny , Computer Simulation , Algorithms , Models, Genetic
6.
Mol Phylogenet Evol ; 197: 108091, 2024 Aug.
Article in English | MEDLINE | ID: mdl-38719080

ABSTRACT

Cryptic diversity poses a great obstacle in our attempts to assess the current biodiversity crisis and may hamper conservation efforts. The gekkonid genus Mediodactylus, a well-known case of hidden species and genetic diversity, has been taxonomically reclassified several times during the last decade. Focusing on the Mediterranean populations, a recent study within the M. kotschyi species complex using classic mtDNA/nuDNA markers suggested the existence of five distinct species, some being endemic and some possibly threatened, yet their relationships have not been fully resolved. Here, we generated genome-wide SNPs (using ddRADseq) and applied molecular species delimitation approaches and population genomic analyses to further disentangle these relationships. Τhe most extensive nuclear dataset, so far, encompassing 2,360 loci and âˆ¼ 699,000 bp from across the genome of Mediodactylus gecko, enabled us to resolve previously obscure phylogenetic relationships among the five, recently elevated, Mediodactylus species and to support the hypothesis that the taxon includes several new, undescribed species. Population genomic analyses within each of the proposed species showed strong genetic structure and high levels of genetic differentiation among populations.


Subject(s)
Lizards , Phylogeny , Phylogeography , Animals , Mediterranean Region , Lizards/genetics , Lizards/classification , Polymorphism, Single Nucleotide , Genetic Variation , Genetics, Population , DNA, Mitochondrial/genetics , Sequence Analysis, DNA
7.
Syst Biol ; 72(1): 242-248, 2023 05 19.
Article in English | MEDLINE | ID: mdl-36705582

ABSTRACT

Computing ancestral ranges via the Dispersion Extinction and Cladogensis (DEC) model of biogeography is characterized by an exponential number of states relative to the number of regions considered. This is because the DEC model requires computing a large matrix exponential, which typically accounts for up to 80% of overall runtime. Therefore, the kinds of biogeographical analyses that can be conducted under the DEC model are limited by the number of regions under consideration. In this work, we present a completely redesigned efficient version of the popular tool Lagrange which is up to 49 times faster with multithreading enabled, and is also 26 times faster when using only one thread. We call this new version Lagrange-NG (Lagrange-Next Generation). The increased computational efficiency allows Lagrange-NG to analyze datasets with a large number of regions in a reasonable amount of time, up to 12 regions in approximately 18 min. We achieve these speedups using a relatively new method of computing the matrix exponential based on Krylov subspaces. In order to validate the correctness of Lagrange-NG, we also introduce a novel metric on range distributions for trees so that researchers can assess the difference between any two range inferences. Finally, Lagrange-NG exhibits substantially higher adherence to coding quality standards. It improves a respective software quality indicator as implemented in the SoftWipe tool from average (5.5; Lagrange) to high (7.8; Lagrange-NG). Lagrange-NG is freely available under GPL2. [Biogeography; Phylogenetics; DEC Model.].


Subject(s)
Software , Phylogeny
8.
Mol Biol Evol ; 39(12)2022 12 05.
Article in English | MEDLINE | ID: mdl-36395091

ABSTRACT

Phylogenetic analyzes under the Maximum-Likelihood (ML) model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a dataset and present Pythia, a Random Forest Regressor that accurately predicts this difficulty. Pythia predicts the degree of difficulty of analyzing a dataset prior to initiating ML-based tree inferences. Pythia can be used to increase user awareness with respect to the amount of signal and uncertainty to be expected in phylogenetic analyzes, and hence inform an appropriate (post-)analysis setup. Further, it can be used to select appropriate search algorithms for easy-, intermediate-, and hard-to-analyze datasets.


Subject(s)
Models, Genetic , Phylogeny , Likelihood Functions , Random Forest
9.
Mol Biol Evol ; 39(2)2022 02 03.
Article in English | MEDLINE | ID: mdl-35021210

ABSTRACT

Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modeling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated data sets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large data sets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188 species from 31,612 gene families in 1 h using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.


Subject(s)
Algorithms , Gene Duplication , Models, Genetic , Pedigree , Phylogeny
10.
Bioinformatics ; 38(15): 3725-3733, 2022 08 02.
Article in English | MEDLINE | ID: mdl-35713506

ABSTRACT

MOTIVATION: Phylogenetic networks can represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting (ILS). Unfortunately, this induces a very high computational complexity and current tools can only analyze small datasets. RESULTS: We present NetRAX, a tool for maximum likelihood (ML) inference of phylogenetic networks in the absence of ILS. Our tool leverages state-of-the-art methods for efficiently computing the phylogenetic likelihood function on trees, and extends them to phylogenetic networks via the notion of 'displayed trees'. NetRAX can infer ML phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format. On simulated data, our results show a very low relative difference in Bayesian Information Criterion (BIC) score and a near-zero unrooted softwired cluster distance to the true, simulated networks. With NetRAX, a network inference on a partitioned alignment with 8000 sites, 30 taxa and 3 reticulations completes within a few minutes on a standard laptop. AVAILABILITY AND IMPLEMENTATION: Our implementation is available under the GNU General Public License v3.0 at https://github.com/lutteropp/NetRAX. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Phylogeny , Bayes Theorem , Sequence Alignment , Likelihood Functions
11.
Bioinformatics ; 38(6): 1741-1742, 2022 03 04.
Article in English | MEDLINE | ID: mdl-34962976

ABSTRACT

SUMMARY: The assessment of novel phylogenetic models and inference methods is routinely being conducted via experiments on simulated as well as empirical data. When generating synthetic data it is often unclear how to set simulation parameters for the models and generate trees that appropriately reflect empirical model parameter distributions and tree shapes. As a solution, we present and make available a new database called 'RAxML Grove' currently comprising more than 60 000 inferred trees and respective model parameter estimates from fully anonymized empirical datasets that were analyzed using RAxML and RAxML-NG on two web servers. We also describe and make available two simple applications of RAxML Grove to exemplify its usage and highlight its utility for designing realistic simulation studies and analyzing empirical model parameter and tree shape distributions. AVAILABILITY AND IMPLEMENTATION: RAxML Grove is freely available at https://github.com/angtft/RAxMLGrove. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computers , Software , Phylogeny , Computer Simulation , Databases, Factual
12.
Bioinformatics ; 38(Suppl 1): i118-i124, 2022 06 24.
Article in English | MEDLINE | ID: mdl-35758778

ABSTRACT

MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. RESULTS: Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. AVAILABILITY AND IMPLEMENTATION: The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Artificial Intelligence , Software , Likelihood Functions , Phylogeny
13.
J Eukaryot Microbiol ; 70(5): e12990, 2023.
Article in English | MEDLINE | ID: mdl-37448139

ABSTRACT

Taxonomic assignment of operational taxonomic units (OTUs) is an important bioinformatics step in analyzing environmental sequencing data. Pairwise alignment and phylogenetic-placement methods represent two alternative approaches to taxonomic assignments, but their results can differ. Here we used available colpodean ciliate OTUs from forest soils to compare the taxonomic assignments of VSEARCH (which performs pairwise alignments) and EPA-ng (which performs phylogenetic placements). We showed that when there are differences in taxonomic assignments between pairwise alignments and phylogenetic placements at the subtaxon level, there is a low pairwise similarity of the OTUs to the reference database. We then showcase how the output of EPA-ng can be further evaluated using GAPPA to assess the taxonomic assignments when there exist multiple equally likely placements of an OTU, by taking into account the sum over the likelihood weights of the OTU placements within a subtaxon, and the branch distances between equally likely placement locations. We also inferred the evolutionary and ecological characteristics of the colpodean OTUs using their placements within subtaxa. This study demonstrates how to fully analyze the output of EPA-ng, by using GAPPA in conjunction with knowledge of the taxonomic diversity of the clade of interest.


Subject(s)
DNA, Environmental , Phylogeny
14.
Mol Biol Evol ; 38(5): 1744-1760, 2021 05 04.
Article in English | MEDLINE | ID: mdl-33226083

ABSTRACT

Anthozoan corals are an ecologically important group of cnidarians, which power the productivity of reef ecosystems. They are sessile, inhabit shallow, tropical oceans and are highly dependent on sun- and moonlight to regulate sexual reproduction, phototaxis, and photosymbiosis. However, their exposure to high levels of sunlight also imposes an increased risk of UV-induced DNA damage. How have these challenging photic environments influenced photoreceptor evolution and function in these animals? To address this question, we initially screened the cnidarian photoreceptor repertoire for Anthozoa-specific signatures by a broad-scale evolutionary analysis. We compared transcriptomic data of more than 36 cnidarian species and revealed a more diverse photoreceptor repertoire in the anthozoan subphylum than in the subphylum Medusozoa. We classified the three principle opsin classes into distinct subtypes and showed that Anthozoa retained all three classes, which diversified into at least six subtypes. In contrast, in Medusozoa, only one class with a single subtype persists. Similarly, in Anthozoa, we documented three photolyase classes and two cryptochrome (CRY) classes, whereas CRYs are entirely absent in Medusozoa. Interestingly, we also identified one anthozoan CRY class, which exhibited unique tandem duplications of the core functional domains. We next explored the functionality of anthozoan photoreceptors in the model species Exaiptasia diaphana (Aiptasia), which recapitulates key photo-behaviors of corals. We show that the diverse opsin genes are differentially expressed in important life stages common to reef-building corals and Aiptasia and that CRY expression is light regulated. We thereby provide important clues linking coral evolution with photoreceptor diversification.


Subject(s)
Anthozoa/genetics , Biological Evolution , Cryptochromes/genetics , Opsins/genetics , Photoreceptor Cells, Invertebrate/metabolism , Animals , Anthozoa/metabolism , Cryptochromes/metabolism , Opsins/metabolism
15.
Mol Biol Evol ; 38(5): 1777-1791, 2021 05 04.
Article in English | MEDLINE | ID: mdl-33316067

ABSTRACT

Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8,736 out of all 16,453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into subclasses using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.


Subject(s)
COVID-19/genetics , Evolution, Molecular , Genome, Viral , Mutation , Phylogeny , SARS-CoV-2/genetics , Humans
16.
Bioinformatics ; 37(22): 4056-4063, 2021 11 18.
Article in English | MEDLINE | ID: mdl-34037680

ABSTRACT

MOTIVATION: Phylogenetic trees are now routinely inferred on large scale high performance computing systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood-based phylogenetic tree inference. RESULTS: We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 1.00 ± 0.04. The overall slowdown by using these recovery mechanisms in conjunction with a fault-tolerant Message Passing Interface implementation amounts to on average 1.7 ± 0.6 for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery and failures during checkpointing. Recoveries are automatic and transparent to the user. AVAILABILITY AND IMPLEMENTATION: The modified fault-tolerant RAxML-NG code is available under GNU GPL at https://github.com/lukashuebner/ft-raxml-ng. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Phylogeny , User-Computer Interface , Algorithms , Software
17.
Bioinformatics ; 38(1): 267-269, 2021 12 22.
Article in English | MEDLINE | ID: mdl-34244702

ABSTRACT

MOTIVATION: Previously we presented swarm, an open-source amplicon clustering programme that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds. Here, we present swarm v3 to address issues of contemporary datasets that are growing towards tera-byte sizes. RESULTS: When compared with previous swarm versions, swarm v3 has modernized C++ source code, reduced memory footprint by up to 50%, optimized CPU-usage and multithreading (more than 7 times faster with default parameters), and it has been extensively tested for its robustness and logic. AVAILABILITY AND IMPLEMENTATION: Source code and binaries are available at https://github.com/torognes/swarm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Software , Cluster Analysis
19.
BMC Bioinformatics ; 22(1): 225, 2021 May 01.
Article in English | MEDLINE | ID: mdl-33932975

ABSTRACT

BACKGROUND: In phylogenetic analysis, it is common to infer unrooted trees. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as molecular clock analysis (including midpoint rooting) or rooting the tree using an outgroup. Non-reversible Markov models can also be used to compute the likelihood of a potential root position. RESULTS: We present a software called RootDigger which uses a non-reversible Markov model to compute the most likely root location on a given tree and to infer a confidence value for each possible root placement. We find that RootDigger is successful at finding roots when compared to similar tools such as IQ-TREE and MAD, and will occasionally outperform them. Additionally, we find that the exhaustive mode of RootDigger is useful in quantifying and explaining uncertainty in rooting positions. CONCLUSIONS: RootDigger can be used on an existing phylogeny to find a root, or to asses the uncertainty of the root placement. RootDigger is available under the MIT licence at https://www.github.com/computations/root_digger .


Subject(s)
Evolution, Molecular , Software , Models, Genetic , Phylogeny , Probability , Uncertainty
20.
Mol Biol Evol ; 37(9): 2763-2774, 2020 09 01.
Article in English | MEDLINE | ID: mdl-32502238

ABSTRACT

Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).


Subject(s)
Gene Duplication , Genetic Techniques , Phylogeny , Software , Cyanobacteria/genetics , Gene Deletion , Gene Transfer, Horizontal
SELECTION OF CITATIONS
SEARCH DETAIL