Your browser doesn't support javascript.
loading
Expectation-Maximization enables Phylogenetic Dating under a Categorical Rate Model.
Mai, Uyen; Charvel, Eduardo; Mirarab, Siavash.
Afiliação
  • Mai U; Department of Computer Science and Engineering, UC San Diego, CA 92093, USA.
  • Charvel E; Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA.
  • Mirarab S; Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA.
Syst Biol ; 2024 Jul 05.
Article em En | MEDLINE | ID: mdl-38970346
ABSTRACT
Dating phylogenetic trees to obtain branch lengths in time unit is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a distribution of branch rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical distribution of branch rates vastly differs from the true distribution. Notably, most existing methods assume rigid, often unimodal, branch rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation- Maximization (EM) algorithm to co-estimate rate categories and branch lengths in time units. Our model has fewer assumptions about the true distribution of branch rates than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with exponential or multimodal rate distributions.
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article