Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
Add more filters










Publication year range
1.
Biometrika ; 111(1): 171-193, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38352626

ABSTRACT

Rooted and ranked phylogenetic trees are mathematical objects that are useful in modelling hierarchical data and evolutionary relationships with applications to many fields such as evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explores the posterior distribution of trees via Markov chain Monte Carlo methods. However, assessing uncertainty and summarizing distributions remains challenging for these types of structures. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees that are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess the stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees and unlabelled ranked genealogies, or trees equipped with branch lengths, to define the Fréchet mean, variance and interquartile sets as summaries of these tree distributions. We provide an efficient combinatorial optimization algorithm for computing the Fréchet mean of a sample or of distributions on unlabelled ranked tree shapes and unlabelled ranked genealogies. We show the applicability of our summary statistics for studying popular tree distributions and for comparing the SARS-CoV-2 evolutionary trees across different locations during the COVID-19 epidemic in 2020. Our current implementations are publicly available at https://github.com/RSamyak/fmatrix.

2.
PLoS Comput Biol ; 19(3): e1010897, 2023 03.
Article in English | MEDLINE | ID: mdl-36940209

ABSTRACT

The coalescent is a powerful statistical framework that allows us to infer past population dynamics leveraging the ancestral relationships reconstructed from sampled molecular sequence data. In many biomedical applications, such as in the study of infectious diseases, cell development, and tumorgenesis, several distinct populations share evolutionary history and therefore become dependent. The inference of such dependence is a highly important, yet a challenging problem. With advances in sequencing technologies, we are well positioned to exploit the wealth of high-resolution biological data for tackling this problem. Here, we present adaPop, a probabilistic model to estimate past population dynamics of dependent populations and to quantify their degree of dependence. An essential feature of our approach is the ability to track the time-varying association between the populations while making minimal assumptions on their functional shapes via Markov random field priors. We provide nonparametric estimators, extensions of our base model that integrate multiple data sources, and fast scalable inference algorithms. We test our method using simulated data under various dependent population histories and demonstrate the utility of our model in shedding light on evolutionary histories of different variants of SARS-CoV-2.


Subject(s)
COVID-19 , Humans , Bayes Theorem , COVID-19/epidemiology , SARS-CoV-2/genetics , Population Dynamics , Models, Statistical , Algorithms , Models, Genetic , Genetics, Population
3.
Stat Sci ; 37(2): 162-182, 2022 May.
Article in English | MEDLINE | ID: mdl-36034090

ABSTRACT

Genomic surveillance of SARS-CoV-2 has been instrumental in tracking the spread and evolution of the virus during the pandemic. The availability of SARS-CoV-2 molecular sequences isolated from infected individuals, coupled with phylodynamic methods, have provided insights into the origin of the virus, its evolutionary rate, the timing of introductions, the patterns of transmission, and the rise of novel variants that have spread through populations. Despite enormous global efforts of governments, laboratories, and researchers to collect and sequence molecular data, many challenges remain in analyzing and interpreting the data collected. Here, we describe the models and methods currently used to monitor the spread of SARS-CoV-2, discuss long-standing and new statistical challenges, and propose a method for tracking the rise of novel variants during the epidemic.

4.
J Comput Graph Stat ; 31(2): 541-552, 2022.
Article in English | MEDLINE | ID: mdl-36035966

ABSTRACT

Longitudinal molecular data of rapidly evolving viruses and pathogens provide information about disease spread and complement traditional surveillance approaches based on case count data. The coalescent is used to model the genealogy that represents the sample ancestral relationships. The basic assumption is that coalescent events occur at a rate inversely proportional to the effective population size N e (t), a time-varying measure of genetic diversity. When the sampling process (collection of samples over time) depends on N e (t), the coalescent and the sampling processes can be jointly modeled to improve estimation of N e (t). Failing to do so can lead to bias due to model misspecification. However, the way that the sampling process depends on the effective population size may vary over time. We introduce an approach where the sampling process is modeled as an inhomogeneous Poisson process with rate equal to the product of N e (t) and a time-varying coefficient, making minimal assumptions on their functional shapes via Markov random field priors. We provide efficient algorithms for inference, show the model performance vis-a-vis alternative methods in a simulation study, and apply our model to SARS-CoV-2 sequences from Los Angeles and Santa Clara counties. The methodology is implemented and available in the R package adapref. Supplementary files for this article are available online.

5.
J Math Biol ; 84(6): 54, 2022 05 12.
Article in English | MEDLINE | ID: mdl-35552538

ABSTRACT

Evolutionary models used for describing molecular sequence variation suppose that at a non-recombining genomic segment, sequences share ancestry that can be represented as a genealogy-a rooted, binary, timed tree, with tips corresponding to individual sequences. Under the infinitely-many-sites mutation model, mutations are randomly superimposed along the branches of the genealogy, so that every mutation occurs at a chromosomal site that has not previously mutated; if a mutation occurs at an interior branch, then all individuals descending from that branch carry the mutation. The implication is that observed patterns of molecular variation from this model impose combinatorial constraints on the hidden state space of genealogies. In particular, observed molecular variation can be represented in the form of a perfect phylogeny, a tree structure that fully encodes the mutational differences among sequences. For a sample of n sequences, a perfect phylogeny might not possess n distinct leaves, and hence might be compatible with many possible binary tree structures that could describe the evolutionary relationships among the n sequences. Here, we investigate enumerative properties of the set of binary ranked and unranked tree shapes that are compatible with a perfect phylogeny, and hence, the binary ranked and unranked tree shapes conditioned on an observed pattern of mutations under the infinitely-many-sites mutation model. We provide a recursive enumeration of these shapes. We consider both perfect phylogenies that can be represented as binary and those that are multifurcating. The results have implications for computational aspects of the statistical inference of evolutionary parameters that underlie sets of molecular sequences.


Subject(s)
Biological Evolution , Models, Genetic , Algorithms , Humans , Mutation , Phylogeny
6.
Int J Infect Dis ; 116: 11-13, 2022 Mar.
Article in English | MEDLINE | ID: mdl-34902583

ABSTRACT

OBJECTIVE: We quantify the impact of COVID-19-related control measures on the spread of human influenza virus H1N1 and H3N2. METHODS: We analyzed case numbers to estimate the end of the 2019-2020 influenza season and compared it with the median of the previous 9 seasons. In addition, we used influenza molecular data to compare within-region and between-region genetic diversity and effective population size from 2019 to 2020. Finally, we analyzed personal behavior and policy stringency data for each region. RESULTS: The 2019-2020 influenza season ended earlier than the median of the previous 9 seasons in all regions. For H1N1 and H3N2, there was an increase in between-region genetic diversity in most pairs of regions between 2019 and 2020. There was a decrease in within-region genetic diversity for 12 of 14 regions for H1N1 and 9 of 12 regions for H3N2. There was a decrease in effective population size for 10 of 13 regions for H1N1 and 3 of 7 regions for H3N2. CONCLUSIONS: We found consistent evidence of a decrease in influenza incidence after the introduction of preventive measures due to COVID-19 emergence.


Subject(s)
COVID-19 , Influenza A Virus, H1N1 Subtype , Influenza, Human , COVID-19/epidemiology , COVID-19/prevention & control , Humans , Influenza A Virus, H1N1 Subtype/genetics , Influenza A Virus, H3N2 Subtype/genetics , Influenza, Human/epidemiology , Influenza, Human/prevention & control , SARS-CoV-2/genetics , Seasons
7.
Proc Natl Acad Sci U S A ; 117(46): 28876-28886, 2020 11 17.
Article in English | MEDLINE | ID: mdl-33139566

ABSTRACT

Genealogical tree modeling is essential for estimating evolutionary parameters in population genetics and phylogenetics. Recent mathematical results concerning ranked genealogies without leaf labels unlock opportunities in the analysis of evolutionary trees. In particular, comparisons between ranked genealogies facilitate the study of evolutionary processes of different organisms sampled at multiple time periods. We propose metrics on ranked tree shapes and ranked genealogies for lineages isochronously and heterochronously sampled. Our proposed tree metrics make it possible to conduct statistical analyses of ranked tree shapes and timed ranked tree shapes or ranked genealogies. Such analyses allow us to assess differences in tree distributions, quantify estimation uncertainty, and summarize tree distributions. We show the utility of our metrics via simulations and an application in infectious diseases.


Subject(s)
Genetics, Population/methods , Sequence Analysis, DNA/methods , Biological Evolution , Computer Simulation , Models, Genetic , Pedigree , Phylogeny
8.
ArXiv ; 2020 Sep 04.
Article in English | MEDLINE | ID: mdl-32908947

ABSTRACT

Longitudinal molecular data of rapidly evolving viruses and pathogens provide information about disease spread and complement traditional surveillance approaches based on case count data. The coalescent is used to model the genealogy that represents the sample ancestral relationships. The basic assumption is that coalescent events occur at a rate inversely proportional to the effective population size $N_{e}(t)$, a time-varying measure of genetic diversity. When the sampling process (collection of samples over time) depends on $N_{e}(t)$, the coalescent and the sampling processes can be jointly modeled to improve estimation of $N_{e}(t)$. Failing to do so can lead to bias due to model misspecification. However, the way that the sampling process depends on the effective population size may vary over time. We introduce an approach where the sampling process is modeled as an inhomogeneous Poisson process with rate equal to the product of $N_{e}(t)$ and a time-varying coefficient, making minimal assumptions on their functional shapes via Markov random field priors. We provide scalable algorithms for inference, show the model performance vis-a-vis alternative methods in a simulation study, and apply our model to SARS-CoV-2 sequences from Los Angeles and Santa Clara counties. The methodology is implemented and available in the R package adapref.

10.
Ann Appl Stat ; 14(2): 727-751, 2020 Jun.
Article in English | MEDLINE | ID: mdl-33995755

ABSTRACT

Statistical inference of evolutionary parameters from molecular sequence data relies on coalescent models to account for the shared genealogical ancestry of the samples. However, inferential algorithms do not scale to available data sets. A strategy to improve computational efficiency is to rely on simpler coalescent and mutation models, resulting in smaller hidden state spaces. An estimate of the cardinality of the state-space of genealogical trees at different resolutions is essential to decide the best modeling strategy for a given dataset. To our knowledge, there is neither an exact nor approximate method to determine these cardinalities. We propose a sequential importance sampling algorithm to estimate the cardinality of the sample space of genealogical trees under different coalescent resolutions. Our sampling scheme proceeds sequentially across the set of combinatorial constraints imposed by the data, which in this work are completely linked sequences of DNA at a non recombining segment. We analyze the cardinality of different genealogical tree spaces on simulations to study the settings that favor coarser resolutions. We apply our method to estimate the cardinality of genealogical tree spaces from mtDNA data from the 1000 genomes and a sample from a Melanesian population at the ß-globin locus.

11.
Genetics ; 213(3): 967-986, 2019 11.
Article in English | MEDLINE | ID: mdl-31511299

ABSTRACT

The large state space of gene genealogies is a major hurdle for inference methods based on Kingman's coalescent. Here, we present a new Bayesian approach for inferring past population sizes, which relies on a lower-resolution coalescent process that we refer to as "Tajima's coalescent." Tajima's coalescent has a drastically smaller state space, and hence it is a computationally more efficient model, than the standard Kingman coalescent. We provide a new algorithm for efficient and exact likelihood calculations for data without recombination, which exploits a directed acyclic graph and a correspondingly tailored Markov Chain Monte Carlo method. We compare the performance of our Bayesian Estimation of population size changes by Sampling Tajima's Trees (BESTT) with a popular implementation of coalescent-based inference in BEAST using simulated and human data. We empirically demonstrate that BESTT can accurately infer effective population sizes, and it further provides an efficient alternative to the Kingman's coalescent. The algorithms described here are implemented in the R package phylodyn, which is available for download at https://github.com/JuliaPalacios/phylodyn.


Subject(s)
Genetics, Population/methods , Models, Genetic , Software , Bayes Theorem
12.
Theor Popul Biol ; 125: 75-93, 2019 02.
Article in English | MEDLINE | ID: mdl-30571959

ABSTRACT

Recovery of population size history from molecular sequence data is an important problem in population genetics. Inference commonly relies on a coalescent model linking the population size history to genealogies. The high computational cost of estimating parameters from these models usually compels researchers to select a subset of the available data or to rely on insufficient summary statistics for statistical inference. We consider the problem of recovering the true population size history from two possible alternatives on the basis of coalescent time data previously considered by Kim et al. (2015). We improve upon previous results by giving exact expressions for the probability of correctly distinguishing between the two hypotheses as a function of the separation between the alternative size histories, the number of individuals, loci, and the sampling times. In more complicated settings we estimate the exact probability of correct recovery by Monte Carlo simulation. Our results give considerably more pessimistic inferential limits than those previously reported. We also extended our analyses to pairwise SMC and SMC' models of recombination. This work is relevant for optimal design when the inference goal is to test scientific hypotheses about population size trajectories in coalescent models with and without recombination.


Subject(s)
Bayes Theorem , Genetics, Population , Genetic Variation , Genetics, Population/statistics & numerical data , Markov Chains , Molecular Sequence Data , Population Density
13.
Mol Ecol Resour ; 17(1): 96-100, 2017 Jan.
Article in English | MEDLINE | ID: mdl-27801980

ABSTRACT

We introduce phylodyn, an r package for phylodynamic analysis based on gene genealogies. The package's main functionality is Bayesian nonparametric estimation of effective population size fluctuations over time. Our implementation includes several Markov chain Monte Carlo-based methods and an integrated nested Laplace approximation-based approach for phylodynamic inference that have been developed in recent years. Genealogical data describe the timed ancestral relationships of individuals sampled from a population of interest. Here, individuals are assumed to be sampled at the same point in time (isochronous sampling) or at different points in time (heterochronous sampling); in addition, sampling events can be modelled with preferential sampling, which means that the intensity of sampling events is allowed to depend on the effective population size trajectory. We assume the coalescent and the sequentially Markov coalescent processes as generative models of genealogies. We include several coalescent simulation functions that are useful for testing our phylodynamics methods via simulation studies. We compare the performance and outputs of various methods implemented in phylodyn and outline their strengths and weaknesses. r package phylodyn is available at https://github.com/mdkarcher/phylodyn.


Subject(s)
Biostatistics/methods , Computational Biology/methods , Computer Simulation , Genetics, Population/methods , Population Dynamics , Software
14.
PLoS Comput Biol ; 12(3): e1004789, 2016 Mar.
Article in English | MEDLINE | ID: mdl-26938243

ABSTRACT

Phylodynamics seeks to estimate effective population size fluctuations from molecular sequences of individuals sampled from a population of interest. One way to accomplish this task formulates an observed sequence data likelihood exploiting a coalescent model for the sampled individuals' genealogy and then integrating over all possible genealogies via Monte Carlo or, less efficiently, by conditioning on one genealogy estimated from the sequence data. However, when analyzing sequences sampled serially through time, current methods implicitly assume either that sampling times are fixed deterministically by the data collection protocol or that their distribution does not depend on the size of the population. Through simulation, we first show that, when sampling times do probabilistically depend on effective population size, estimation methods may be systematically biased. To correct for this deficiency, we propose a new model that explicitly accounts for preferential sampling by modeling the sampling times as an inhomogeneous Poisson process dependent on effective population size. We demonstrate that in the presence of preferential sampling our new model not only reduces bias, but also improves estimation precision. Finally, we compare the performance of the currently used phylodynamic methods with our proposed model through clinically-relevant, seasonal human influenza examples.


Subject(s)
Genetics, Population , Hemagglutinins/genetics , Influenza A Virus, H3N2 Subtype/genetics , Models, Genetic , Models, Statistical , Biological Evolution , Computer Simulation , Data Interpretation, Statistical , Genetic Variation/genetics , Phylogeny , Sample Size
15.
Genetics ; 201(1): 281-304, 2015 Sep.
Article in English | MEDLINE | ID: mdl-26224734

ABSTRACT

Sophisticated inferential tools coupled with the coalescent model have recently emerged for estimating past population sizes from genomic data. Recent methods that model recombination require small sample sizes, make constraining assumptions about population size changes, and do not report measures of uncertainty for estimates. Here, we develop a Gaussian process-based Bayesian nonparametric method coupled with a sequentially Markov coalescent model that allows accurate inference of population sizes over time from a set of genealogies. In contrast to current methods, our approach considers a broad class of recombination events, including those that do not change local genealogies. We show that our method outperforms recent likelihood-based methods that rely on discretization of the parameter space. We illustrate the application of our method to multiple demographic histories, including population bottlenecks and exponential growth. In simulation, our Bayesian approach produces point estimates four times more accurate than maximum-likelihood estimation (based on the sum of absolute differences between the truth and the estimated values). Further, our method's credible intervals for population size as a function of time cover 90% of true values across multiple demographic scenarios, enabling formal hypothesis testing about population size differences over time. Using genealogies estimated with ARGweaver, we apply our method to European and Yoruban samples from the 1000 Genomes Project and confirm key known aspects of population size history over the past 150,000 years.


Subject(s)
Black People/genetics , Computational Biology/methods , Pedigree , White People/genetics , Bayes Theorem , Genome, Human , Human Migration , Humans , Markov Chains , Models, Genetic , Population Density
16.
Bioinformatics ; 31(20): 3282-9, 2015 Oct 15.
Article in English | MEDLINE | ID: mdl-26093147

ABSTRACT

MOTIVATION: The field of phylodynamics focuses on the problem of reconstructing population size dynamics over time using current genetic samples taken from the population of interest. This technique has been extensively used in many areas of biology but is particularly useful for studying the spread of quickly evolving infectious diseases agents, e.g. influenza virus. Phylodynamic inference uses a coalescent model that defines a probability density for the genealogy of randomly sampled individuals from the population. When we assume that such a genealogy is known, the coalescent model, equipped with a Gaussian process prior on population size trajectory, allows for nonparametric Bayesian estimation of population size dynamics. Although this approach is quite powerful, large datasets collected during infectious disease surveillance challenge the state-of-the-art of Bayesian phylodynamics and demand inferential methods with relatively low computational cost. RESULTS: To satisfy this demand, we provide a computationally efficient Bayesian inference framework based on Hamiltonian Monte Carlo for coalescent process models. Moreover, we show that by splitting the Hamiltonian function, we can further improve the efficiency of this approach. Using several simulated and real datasets, we show that our method provides accurate estimates of population size dynamics and is substantially faster than alternative methods based on elliptical slice sampler and Metropolis-adjusted Langevin algorithm. AVAILABILITY AND IMPLEMENTATION: The R code for all simulation studies and real data analysis conducted in this article are publicly available at http://www.ics.uci.edu/∼slan/lanzi/CODES.html and in the R package phylodyn available at https://github.com/mdkarcher/phylodyn. CONTACT: S.Lan@warwick.ac.uk or babaks@uci.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genetics, Population/methods , Algorithms , Bayes Theorem , Humans , Influenza, Human/epidemiology , Models, Statistical , Monte Carlo Method , Orthomyxoviridae/genetics , Population Density , Population Dynamics , Software , Statistics, Nonparametric
17.
Biometrics ; 69(1): 8-18, 2013 Mar.
Article in English | MEDLINE | ID: mdl-23409705

ABSTRACT

Changes in population size influence genetic diversity of the population and, as a result, leave a signature of these changes in individual genomes in the population. We are interested in the inverse problem of reconstructing past population dynamics from genomic data. We start with a standard framework based on the coalescent, a stochastic process that generates genealogies connecting randomly sampled individuals from the population of interest. These genealogies serve as a glue between the population demographic history and genomic sequences. It turns out that only the times of genealogical lineage coalescences contain information about population size dynamics. Viewing these coalescent times as a point process, estimating population size trajectories is equivalent to estimating a conditional intensity of this point process. Therefore, our inverse problem is similar to estimating an inhomogeneous Poisson process intensity function. We demonstrate how recent advances in Gaussian process-based nonparametric inference for Poisson processes can be extended to Bayesian nonparametric estimation of population size dynamics under the coalescent. We compare our Gaussian process (GP) approach to one of the state-of-the-art Gaussian Markov random field (GMRF) methods for estimating population trajectories. Using simulated data, we demonstrate that our method has better accuracy and precision. Next, we analyze two genealogies reconstructed from real sequences of hepatitis C and human Influenza A viruses. In both cases, we recover more believed aspects of the viral demographic histories than the GMRF approach. We also find that our GP method produces more reasonable uncertainty estimates than the GMRF method.


Subject(s)
Bayes Theorem , Genetic Variation , Models, Genetic , Models, Statistical , Population Density , Population Dynamics , Computer Simulation , Hepacivirus/genetics , Hepatitis C/epidemiology , Humans , Influenza A Virus, H3N2 Subtype/genetics , Influenza, Human/epidemiology
SELECTION OF CITATIONS
SEARCH DETAIL