RESUMO
Bayesian phylogenetic inference requires a tree prior, which models the underlying diversification process that gives rise to the phylogeny. Existing birth-death diversification models include a wide range of features, for instance, lineage-specific variations in speciation and extinction (SSE) rates. While across-lineage variation in SSE rates is widespread in empirical datasets, few heterogeneous rate models have been implemented as tree priors for Bayesian phylogenetic inference. As a consequence, rate heterogeneity is typically ignored when reconstructing phylogenies, and rate heterogeneity is usually investigated on fixed trees. In this paper, we present a new BEAST2 package implementing the cladogenetic diversification rate shift (ClaDS) model as a tree prior. ClaDS is a birth-death diversification model designed to capture small progressive variations in birth and death rates along a phylogeny. Unlike previous implementations of ClaDS, which were designed to be used with fixed, user-chosen phylogenies, our package is implemented in the BEAST2 framework and thus allows full phylogenetic inference, where the phylogeny and model parameters are co-estimated from a molecular alignment. Our package provides all necessary components of the inference, including a new tree object and operators to propose moves to the Monte-Carlo Markov chain. It also includes a graphical interface through BEAUti. We validate our implementation of the package by comparing the produced distributions to simulated data and show an empirical example of the full inference, using a dataset of cetaceans.
Assuntos
Especiação Genética , Filogenia , Teorema de Bayes , Método de Monte Carlo , Cadeias de MarkovRESUMO
Bayesian total-evidence approaches under the fossilized birth-death model enable biologists to combine fossil and extant data while accounting for uncertainty in the ages of fossil specimens, in an integrative phylogenetic analysis. Fossil age uncertainty is a key feature of the fossil record as many empirical data sets may contain a mix of precisely dated and poorly dated fossil specimens or deposits. In this study, we explore whether reliable age estimates for fossil specimens can be obtained from Bayesian total-evidence phylogenetic analyses under the fossilized birth-death model. Through simulations based on the example of the Baltic amber deposit, we show that estimates of fossil ages obtained through such an analysis are accurate, particularly when the proportion of poorly dated specimens remains low and the majority of fossil specimens have precise dates. We confirm our results using an empirical data set of living and fossil penguins by artificially increasing the age uncertainty around some fossil specimens and showing that the resulting age estimates overlap with the recorded age ranges. Our results are applicable to many empirical data sets where classical methods of establishing fossil ages have failed, such as the Baltic amber and the Gobi Desert deposits. [Bayesian phylogenetic inference; fossil age estimates; fossilized birth-death; Lagerstätte; total-evidence.].
Assuntos
Âmbar , Fósseis , Filogenia , Teorema de Bayes , Modelos BiológicosRESUMO
For much of terrestrial biodiversity, the evolutionary pathways of adaptation from marine ancestors are poorly understood, and have usually been viewed as a binary trait. True crabs, the decapod crustacean infraorder Brachyura, comprise over 7,600 species representing a striking diversity of morphology and ecology, including repeated adaptation to non-marine habitats. Here, we reconstruct the evolutionary history of Brachyura using new and published sequences of 10 genes for 344 tips spanning 88 of 109 brachyuran families. Using 36 newly vetted fossil calibrations, we infer that brachyurans most likely diverged in the Triassic, with family-level splits in the late Cretaceous and early Paleogene. By contrast, the root age is underestimated with automated sampling of 328 fossil occurrences explicitly incorporated into the tree prior, suggesting such models are a poor fit under heterogeneous fossil preservation. We apply recently defined trait-by-environment associations to classify a gradient of transitions from marine to terrestrial lifestyles. We estimate that crabs left the marine environment at least seven and up to 17 times convergently, and returned to the sea from non-marine environments at least twice. Although the most highly terrestrial- and many freshwater-adapted crabs are concentrated in Thoracotremata, Bayesian threshold models of ancestral state reconstruction fail to identify shifts to higher terrestrial grades due to the degree of underlying change required. Lineages throughout our tree inhabit intertidal and marginal marine environments, corroborating the inference that the early stages of terrestrial adaptation have a lower threshold to evolve. Our framework and extensive new fossil and natural history datasets will enable future comparisons of non-marine adaptation at the morphological and molecular level. Crabs provide an important window into the early processes of adaptation to novel environments, and different degrees of evolutionary constraint that might help predict these pathways.
RESUMO
Phylodynamic models generally aim at jointly inferring phylogenetic relationships, model parameters, and more recently, the number of lineages through time, based on molecular sequence data. In the fields of epidemiology and macroevolution, these models can be used to estimate, respectively, the past number of infected individuals (prevalence) or the past number of species (paleodiversity) through time. Recent years have seen the development of "total-evidence" analyses, which combine molecular and morphological data from extant and past sampled individuals in a unified Bayesian inference framework. Even sampled individuals characterized only by their sampling time, that is, lacking morphological and molecular data, which we call occurrences, provide invaluable information to estimate the past number of lineages. Here, we present new methodological developments around the fossilized birth-death process enabling us to (i) incorporate occurrence data in the likelihood function; (ii) consider piecewise-constant birth, death, and sampling rates; and (iii) estimate the past number of lineages, with or without knowledge of the underlying tree. We implement our method in the RevBayes software environment, enabling its use along with a large set of models of molecular and morphological evolution, and validate the inference workflow using simulations under a wide range of conditions. We finally illustrate our new implementation using two empirical data sets stemming from the fields of epidemiology and macroevolution. In epidemiology, we infer the prevalence of the coronavirus disease 2019 outbreak on the Diamond Princess ship, by taking into account jointly the case count record (occurrences) along with viral sequences for a fraction of infected individuals. In macroevolution, we infer the diversity trajectory of cetaceans using molecular and morphological data from extant taxa, morphological data from fossils, as well as numerous fossil occurrences. The joint modeling of occurrences and trees holds the promise to further bridge the gap between traditional epidemiology and pathogen genomics, as well as paleontology and molecular phylogenetics. [Birth-death model; epidemiology; fossils; macroevolution; occurrences; phylogenetics; skyline.].
Assuntos
COVID-19 , Animais , Teorema de Bayes , Cetáceos , Fósseis , Humanos , Paleontologia , FilogeniaRESUMO
Heterogeneous populations can lead to important differences in birth and death rates across a phylogeny. Taking this heterogeneity into account is necessary to obtain accurate estimates of the underlying population dynamics. We present a new multitype birth-death model (MTBD) that can estimate lineage-specific birth and death rates. This corresponds to estimating lineage-dependent speciation and extinction rates for species phylogenies, and lineage-dependent transmission and recovery rates for pathogen transmission trees. In contrast with previous models, we do not presume to know the trait driving the rate differences, nor do we prohibit the same rates from appearing in different parts of the phylogeny. Using simulated data sets, we show that the MTBD model can reliably infer the presence of multiple evolutionary regimes, their positions in the tree, and the birth and death rates associated with each. We also present a reanalysis of two empirical data sets and compare the results obtained by MTBD and by the existing software BAMM. We compare two implementations of the model, one exact and one approximate (assuming that no rate changes occur in the extinct parts of the tree), and show that the approximation only slightly affects results. The MTBD model is implemented as a package in the Bayesian inference software BEAST 2 and allows joint inference of the phylogeny and the model parameters.[Birth-death; lineage specific rates, multi-type model.].
Assuntos
Coeficiente de Natalidade , Classificação/métodos , Modelos Biológicos , Mortalidade , Teorema de Bayes , Simulação por Computador , Filogenia , SoftwareRESUMO
Elaboration of Bayesian phylogenetic inference methods has continued at pace in recent years with major new advances in nearly all aspects of the joint modelling of evolutionary data. It is increasingly appreciated that some evolutionary questions can only be adequately answered by combining evidence from multiple independent sources of data, including genome sequences, sampling dates, phenotypic data, radiocarbon dates, fossil occurrences, and biogeographic range information among others. Including all relevant data into a single joint model is very challenging both conceptually and computationally. Advanced computational software packages that allow robust development of compatible (sub-)models which can be composed into a full model hierarchy have played a key role in these developments. Developing such software frameworks is increasingly a major scientific activity in its own right, and comes with specific challenges, from practical software design, development and engineering challenges to statistical and conceptual modelling challenges. BEAST 2 is one such computational software platform, and was first announced over 4 years ago. Here we describe a series of major new developments in the BEAST 2 core platform and model hierarchy that have occurred since the first release of the software, culminating in the recent 2.5 release.
Assuntos
Teorema de Bayes , Evolução Biológica , Filogenia , Software , Animais , Biologia Computacional , Simulação por Computador , Evolução Molecular , Humanos , Cadeias de Markov , Modelos Genéticos , Método de Monte CarloRESUMO
BACKGROUND: Gene and protein interaction data are often represented as interaction networks, where nodes stand for genes or gene products and each edge stands for a relationship between a pair of gene nodes. Commonly, that relationship within a pair is specified by high similarity between profiles (vectors) of experimentally defined interactions of each of the two genes with all other genes in the genome; only gene pairs that interact with similar sets of genes are linked by an edge in the network. The tight groups of genes/gene products that work together in a cell can be discovered by the analysis of those complex networks. RESULTS: We show that the choice of the similarity measure between pairs of gene vectors impacts the properties of networks and of gene modules detected within them. We re-analyzed well-studied data on yeast genetic interactions, constructed four genetic networks using four different similarity measures, and detected gene modules in each network using the same algorithm. The four networks induced different numbers of putative functional gene modules, and each similarity measure induced some unique modules. In an example of a putative functional connection suggested by comparing genetic interaction vectors, we predict a link between SUN-domain proteins and protein glycosylation in the endoplasmic reticulum. CONCLUSIONS: The discovery of molecular modules in genetic networks is sensitive to the way of measuring similarity between profiles of gene interactions in a cell. In the absence of a formal way to choose the "best" measure, it is advisable to explore the measures with different mathematical properties, which may identify different sets of connections between genes.
Assuntos
Biologia Computacional/métodos , Epistasia Genética , Algoritmos , Redes Reguladoras de Genes , Genes Fúngicos , Glicosilação , Anotação de Sequência Molecular , Domínios Proteicos , Saccharomyces cerevisiae/genética , Estatística como AssuntoRESUMO
Fossil information is essential for estimating species divergence times, and can be integrated into Bayesian phylogenetic inference using the fossilized birth-death (FBD) process. An important aspect of palaeontological data is the uncertainty surrounding specimen ages, which can be handled in different ways during inference. The most common approach is to fix fossil ages to a point estimate within the known age interval. Alternatively, age uncertainty can be incorporated by using priors, and fossil ages are then directly sampled as part of the inference. This study presents a comparison of alternative approaches for handling fossil age uncertainty in analysis using the FBD process. Based on simulations, we find that fixing fossil ages to the midpoint or a random point drawn from within the stratigraphic age range leads to biases in divergence time estimates, while sampling fossil ages leads to estimates that are similar to inferences that employ the correct ages of fossils. Second, we show a comparison using an empirical dataset of extant and fossil cetaceans, which confirms that different methods of handling fossil age uncertainty lead to large differences in estimated node ages. Stratigraphic age uncertainty should thus not be ignored in divergence time estimation and instead should be incorporated explicitly.
Assuntos
Evolução Biológica , Fósseis , Filogenia , Animais , Cetáceos/classificação , Simulação por Computador , Extinção Biológica , Especiação Genética , Paleontologia/métodos , Fatores de TempoRESUMO
Phylogenetics and phylodynamics are central topics in modern evolutionary biology. Phylogenetic methods reconstruct the evolutionary relationships among organisms, whereas phylodynamic approaches reveal the underlying diversification processes that lead to the observed relationships. These two fields have many practical applications in disciplines as diverse as epidemiology, developmental biology, palaeontology, ecology, and linguistics. The combination of increasingly large genetic data sets and increases in computing power is facilitating the development of more sophisticated phylogenetic and phylodynamic methods. Big data sets allow us to answer complex questions. However, since the required analyses are highly specific to the particular data set and question, a black-box method is not sufficient anymore. Instead, biologists are required to be actively involved with modeling decisions during data analysis. The modular design of the Bayesian phylogenetic software package BEAST 2 enables, and in fact enforces, this involvement. At the same time, the modular design enables computational biology groups to develop new methods at a rapid rate. A thorough understanding of the models and algorithms used by inference software is a critical prerequisite for successful hypothesis formulation and assessment. In particular, there is a need for more readily available resources aimed at helping interested scientists equip themselves with the skills to confidently use cutting-edge phylogenetic analysis software. These resources will also benefit researchers who do not have access to similar courses or training at their home institutions. Here, we introduce the "Taming the Beast" (https://taming-the-beast.github.io/) resource, which was developed as part of a workshop series bearing the same name, to facilitate the usage of the Bayesian phylogenetic software package BEAST 2.
Assuntos
Biologia Computacional/educação , Biologia Computacional/métodos , Filogenia , Software , Materiais de Ensino , AlgoritmosRESUMO
Phylogenetic estimation is, and has always been, a complex endeavor. Estimating a phylogenetic tree involves evaluating many possible solutions and possible evolutionary histories that could explain a set of observed data, typically by using a model of evolution. Values for all model parameters need to be evaluated as well. Modern statistical methods involve not just the estimation of a tree, but also solutions to more complex models involving fossil record information and other data sources. Markov chain Monte Carlo (MCMC) is a leading method for approximating the posterior distribution of parameters in a mathematical model. It is deployed in all Bayesian phylogenetic tree estimation software. While many researchers use MCMC in phylogenetic analyses, interpreting results and diagnosing problems with MCMC remain vexing issues to many biologists. In this manuscript, we will offer an overview of how MCMC is used in Bayesian phylogenetic inference, with a particular emphasis on complex hierarchical models, such as the fossilized birth-death (FBD) model. We will discuss strategies to diagnose common MCMC problems and troubleshoot difficult analyses, in particular convergence issues. We will show how the study design, the choice of models and priors, but also technical features of the inference tools themselves can all be adjusted to obtain the best results. Finally, we will also discuss the unique challenges created by the incorporation of fossil information in phylogenetic inference, and present tips to address them.
RESUMO
The multi-type birth-death model with sampling is a phylodynamic model which enables the quantification of past population dynamics in structured populations based on phylogenetic trees. The BEAST 2 package bdmm implements an algorithm for numerically computing the probability density of a phylogenetic tree given the population dynamic parameters under this model. In the initial release of bdmm, analyses were computationally limited to trees consisting of up to approximately 250 genetic samples. We implemented important algorithmic changes to bdmm which dramatically increased the number of genetic samples that could be analyzed and which improved the numerical robustness and efficiency of the calculations. Including more samples led to the improved precision of parameter estimates, particularly for structured models with a high number of inferred parameters. Furthermore, we report on several model extensions to bdmm, inspired by properties common to empirical datasets. We applied this improved algorithm to two partly overlapping datasets of the Influenza A virus HA sequences sampled around the world-one with 500 samples and the other with only 175-for comparison. We report and compare the global migration patterns and seasonal dynamics inferred from each dataset. In this way, we show the information that is gained by analyzing the bigger dataset, which became possible with the presented algorithmic changes to bdmm. In summary, bdmm allows for the robust, faster, and more general phylodynamic inference of larger datasets.
Assuntos
Algoritmos , Filogenia , Dinâmica PopulacionalRESUMO
HIV patients form clusters in HIV transmission networks. Accurate identification of these transmission clusters is essential to effectively target public health interventions. One reason for clustering is that the underlying contact network contains many local communities. We present a new maximum-likelihood method for identifying transmission clusters caused by community structure, based on phylogenetic trees. The method employs a multi-state birth-death (MSBD) model which detects changes in transmission rate, which are interpreted as the introduction of the epidemic into a new susceptible community, i.e. the formation of a new cluster. We show that the MSBD method is able to reliably infer the clusters and the transmission parameters from a pathogen phylogeny based on our simulations. In contrast to existing cutpoint-based methods for cluster identification, our method does not require that clusters be monophyletic nor is it dependent on the selection of a difficult-to-interpret cutpoint parameter. We present an application of our method to data from the Swiss HIV Cohort Study. The method is available as an easy-to-use R package.