Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bull Math Biol ; 86(9): 106, 2024 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-38995457

RESUMO

Maximum likelihood estimation is among the most widely-used methods for inferring phylogenetic trees from sequence data. This paper solves the problem of computing solutions to the maximum likelihood problem for 3-leaf trees under the 2-state symmetric mutation model (CFN model). Our main result is a closed-form solution to the maximum likelihood problem for unrooted 3-leaf trees, given generic data; this result characterizes all of the ways that a maximum likelihood estimate can fail to exist for generic data and provides theoretical validation for predictions made in Parks and Goldman (Syst Biol 63(5):798-811, 2014). Our proof makes use of both classical tools for studying group-based phylogenetic models such as Hadamard conjugation and reparameterization in terms of Fourier coordinates, as well as more recent results concerning the semi-algebraic constraints of the CFN model. To be able to put these into practice, we also give a complete characterization to test genericity.


Assuntos
Conceitos Matemáticos , Modelos Genéticos , Mutação , Filogenia , Funções Verossimilhança , Algoritmos
2.
J Comput Biol ; 30(11): 1146-1181, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37902986

RESUMO

We address the problem of rooting an unrooted species tree given a set of unrooted gene trees, under the assumption that gene trees evolve within the model species tree under the multispecies coalescent (MSC) model. Quintet Rooting (QR) is a polynomial time algorithm that was recently proposed for this problem, which is based on the theory developed by Allman, Degnan, and Rhodes that proves the identifiability of rooted 5-taxon trees from unrooted gene trees under the MSC. However, although QR had good accuracy in simulations, its statistical consistency was left as an open problem. We present QR-STAR, a variant of QR with an additional step and a different cost function, and prove that it is statistically consistent under the MSC. Moreover, we derive sample complexity bounds for QR-STAR and show that a particular variant of it based on "short quintets" has polynomial sample complexity. Finally, our simulation study under a variety of model conditions shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open-source form on github.


Assuntos
Algoritmos , Modelos Genéticos , Filogenia , Simulação por Computador
3.
J Comput Biol ; 29(11): 1173-1197, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36048557

RESUMO

We consider species tree estimation from multiple loci subject to intralocus recombination. We focus on R∗, a summary coalescent-based method using rooted triplets, as well as a related quartet-based inference method. We demonstrate analytically that in both cases, intralocus recombination gives rise to an inconsistency zone, in which correct inference is not assured even in the limit of infinite amount of data. In addition, we validate and characterize this inconsistency zone through a simulation study, which suggests that differential rates of recombination between closely related taxa can amplify the effect of incomplete lineage sorting and contribute to inconsistency.


Assuntos
Especiação Genética , Recombinação Genética , Filogenia , Simulação por Computador , Modelos Genéticos
4.
J Math Biol ; 84(5): 36, 2022 04 08.
Artigo em Inglês | MEDLINE | ID: mdl-35394192

RESUMO

Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.


Assuntos
Especiação Genética , Modelos Genéticos , Filogenia
5.
J Comput Biol ; 28(5): 452-468, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-33325781

RESUMO

Phylogenomics-the estimation of species trees from multilocus data sets-is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this article, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL.


Assuntos
Biologia Computacional/métodos , Deleção de Genes , Duplicação Gênica , Algoritmos , Especiação Genética , Modelos Genéticos , Filogenia
6.
Bull Math Biol ; 82(9): 123, 2020 09 13.
Artigo em Inglês | MEDLINE | ID: mdl-32920679

RESUMO

We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that no consistent distance estimation is possible from sequence lengths alone. More formally, we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to one.


Assuntos
Evolução Molecular , Modelos Genéticos , Sequência de Bases , Conceitos Matemáticos , Filogenia , Probabilidade
7.
Bull Math Biol ; 82(2): 21, 2020 01 22.
Artigo em Inglês | MEDLINE | ID: mdl-31970502

RESUMO

In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and whose branchings indicate past speciation events. Phylogenetic analyses often rely on molecular sequences, such as DNA sequences, collected from the species of interest, and it is common in this context to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides-also known as indels. Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here, we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first explicit reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the "big bang" condition, a necessary and sufficient condition for statistical consistency in this setting.


Assuntos
DNA/genética , Modelos Genéticos , Algoritmos , Sequência de Bases , Biologia Computacional , Simulação por Computador , Evolução Molecular , Mutação INDEL , Funções Verossimilhança , Cadeias de Markov , Conceitos Matemáticos , Modelos Estatísticos , Filogenia , Alinhamento de Sequência/estatística & dados numéricos
8.
Syst Biol ; 68(2): 281-297, 2019 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30247732

RESUMO

With advances in sequencing technologies, there are now massive amounts of genomic data from across all life, leading to the possibility that a robust Tree of Life can be constructed. However, "gene tree heterogeneity", which is when different genomic regions can evolve differently, is a common phenomenon in multi-locus data sets, and reduces the accuracy of standard methods for species tree estimation that do not take this heterogeneity into account. New methods have been developed for species tree estimation that specifically address gene tree heterogeneity, and that have been proven to converge to the true species tree when the number of loci and number of sites per locus both increase (i.e., the methods are said to be "statistically consistent"). Yet, little is known about the biologically realistic condition where the number of sites per locus is bounded. We show that when the sequence length of each locus is bounded (by any arbitrarily chosen value), the most common approaches to species tree estimation that take heterogeneity into account (i.e., traditional fully partitioned concatenated maximum likelihood and newer approaches, called summary methods, that estimate the species tree by combining estimated gene trees) are not statistically consistent, even when the heterogeneity is extremely constrained. The main challenge is the presence of conditions such as long branch attraction that create biased tree estimation when the number of sites is restricted. Hence, our study uncovers a fundamental challenge to species tree estimation using both traditional and new methods.


Assuntos
Classificação/métodos , Filogenia , Funções Verossimilhança , Modelos Genéticos
9.
Proc Natl Acad Sci U S A ; 115(41): 10299-10304, 2018 10 09.
Artigo em Inglês | MEDLINE | ID: mdl-30254152

RESUMO

To sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like [Formula: see text], where n is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is [Formula: see text] We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from a random walk sample of the nodes. These theoretical results point the way to entirely different classes of estimators that account for the network structure beyond node degree. Diagnostic plots help to identify situations where feasible GLS estimators are more appropriate. The computational experiments show the potential benefits and also indicate that there is room to further develop these estimators in practical settings.

10.
Genetics ; 210(2): 665-682, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-30064984

RESUMO

The sample frequency spectrum (SFS), which describes the distribution of mutant alleles in a sample of DNA sequences, is a widely used summary statistic in population genetics. The expected SFS has a strong dependence on the historical population demography and this property is exploited by popular statistical methods to infer complex demographic histories from DNA sequence data. Most, if not all, of these inference methods exhibit pathological behavior, however. Specifically, they often display runaway behavior in optimization, where the inferred population sizes and epoch durations can degenerate to zero or diverge to infinity, and show undesirable sensitivity to perturbations in the data. The goal of this article is to provide theoretical insights into why such problems arise. To this end, we characterize the geometry of the expected SFS for piecewise-constant demographies and use our results to show that the aforementioned pathological behavior of popular inference methods is intrinsic to the geometry of the expected SFS. We provide explicit descriptions and visualizations for a toy model, and generalize our intuition to arbitrary sample sizes using tools from convex and algebraic geometry. We also develop a universal characterization result which shows that the expected SFS of a sample of size n under an arbitrary population history can be recapitulated by a piecewise-constant demography with only [Formula: see text] epochs, where [Formula: see text] is between [Formula: see text] and [Formula: see text] The set of expected SFS for piecewise-constant demographies with fewer than [Formula: see text] epochs is open and nonconvex, which causes the above phenomena for inference from data.


Assuntos
Frequência do Gene , Modelos Genéticos , População/genética , Humanos
11.
IEEE/ACM Trans Comput Biol Bioinform ; 15(5): 1738-1747, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-28976320

RESUMO

Species tree reconstruction from genomic data is increasingly performed using methods that account for sources of gene tree discordance such as incomplete lineage sorting. One popular method for reconstructing species trees from unrooted gene tree topologies is ASTRAL. In this paper, we derive theoretical sample complexity results for the number of genes required by ASTRAL to guarantee reconstruction of the correct species tree with high probability. We also validate those theoretical bounds in a simulation study. Our results indicate that ASTRAL requires gene trees to reconstruct the species tree correctly with high probability where is the number of species and is the length of the shortest branch in the species tree. Our simulations, some under the anomaly zone, show trends consistent with the theoretical bounds and also provide some practical insights on the conditions where ASTRAL works well.


Assuntos
Biologia Computacional/métodos , Especiação Genética , Modelos Genéticos , Filogenia , Algoritmos , Simulação por Computador , Software
12.
J Math Biol ; 74(1-2): 355-385, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-27241727

RESUMO

Diffusion processes on trees are commonly used in evolutionary biology to model the joint distribution of continuous traits, such as body mass, across species. Estimating the parameters of such processes from tip values presents challenges because of the intrinsic correlation between the observations produced by the shared evolutionary history, thus violating the standard independence assumption of large-sample theory. For instance (Ho and Ané, Ann Stat 41:957-981, 2013) recently proved that the mean (also known in this context as selection optimum) of an Ornstein-Uhlenbeck process on a tree cannot be estimated consistently from an increasing number of tip observations if the tree height is bounded. Here, using a fruitful connection to the so-called reconstruction problem in probability theory, we study the convergence rate of parameter estimation in the unbounded height case. For the mean of the process, we provide a necessary and sufficient condition for the consistency of the maximum likelihood estimator (MLE) and establish a phase transition on its convergence rate in terms of the growth of the tree. In particular we show that a loss of [Formula: see text]-consistency (i.e., the variance of the MLE becomes [Formula: see text], where n is the number of tips) occurs when the tree growth is larger than a threshold related to the phase transition of the reconstruction problem. For the covariance parameters, we give a novel, efficient estimation method which achieves [Formula: see text]-consistency under natural assumptions on the tree. Our theoretical results provide practical suggestions for the design of comparative data collection.


Assuntos
Modelos Biológicos , Filogenia , Fenótipo , Probabilidade
13.
Artigo em Inglês | MEDLINE | ID: mdl-26357228

RESUMO

We consider the problem of estimating the evolutionary history of a set of species (phylogeny or species tree) from several genes. It is known that the evolutionary history of individual genes (gene trees) might be topologically distinct from each other and from the underlying species tree, possibly confounding phylogenetic analysis. A further complication in practice is that one has to estimate gene trees from molecular sequences of finite length. We provide the first full data-requirement analysis of a species tree reconstruction method that takes into account estimation errors at the gene level. Under that criterion, we also devise a novel reconstruction algorithm that provably improves over all previous methods in a regime of interest.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Loci Gênicos/genética , Filogenia , Algoritmos , Bases de Dados Genéticas , Especiação Genética
14.
Syst Biol ; 64(4): 663-76, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25813358

RESUMO

The estimation of species trees using multiple loci has become increasingly common. Because different loci can have different phylogenetic histories (reflected in different gene tree topologies) for multiple biological causes, new approaches to species tree estimation have been developed that take gene tree heterogeneity into account. Among these multiple causes, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is potentially the most common cause of gene tree heterogeneity, and much of the focus of the recent literature has been on how to estimate species trees in the presence of ILS. Despite progress in developing statistically consistent techniques for estimating species trees when gene trees can differ due to ILS, there is substantial controversy in the systematics community as to whether to use the new coalescent-based methods or the traditional concatenation methods. One of the key issues that has been raised is understanding the impact of gene tree estimation error on coalescent-based methods that operate by combining gene trees. Here we explore the mathematical guarantees of coalescent-based methods when analyzing estimated rather than true gene trees. Our results provide some insight into the differences between promise of coalescent-based methods in theory and their performance in practice.


Assuntos
Classificação/métodos , Filogenia , Simulação por Computador , Genes/genética
15.
Theor Popul Biol ; 100C: 56-62, 2015 03.
Artigo em Inglês | MEDLINE | ID: mdl-25545843

RESUMO

The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multispecies coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGIorgio and Degnan (2010) and Chifman and Kubtako (2014).

16.
J Comput Biol ; 20(2): 93-112, 2013 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-23383996

RESUMO

Lateral gene transfer (LGT) is a common mechanism of nonvertical evolution, during which genetic material is transferred between two more or less distantly related organisms. It is particularly common in bacteria where it contributes to adaptive evolution with important medical implications. In evolutionary studies, LGT has been shown to create widespread discordance between gene trees as genomes become mosaics of gene histories. In particular, the Tree of Life has been questioned as an appropriate representation of bacterial evolutionary history. Nevertheless a common hypothesis is that prokaryotic evolution is primarily treelike, but that the underlying trend is obscured by LGT. Extensive empirical work has sought to extract a common treelike signal from conflicting gene trees. Here we give a probabilistic perspective on the problem of recovering the treelike trend despite LGT. Under a model of randomly distributed LGT, we show that the species phylogeny can be reconstructed even in the presence of surprisingly many (almost linear number of) LGT events per gene tree. Our results, which are optimal up to logarithmic factors, are based on the analysis of a robust, computationally efficient reconstruction method and provides insight into the design of such methods. Finally, we show that our results have implications for the discovery of highways of gene sharing.


Assuntos
Bactérias/classificação , Bactérias/genética , Transferência Genética Horizontal , Genoma Bacteriano , Modelos Estatísticos , Filogenia , Evolução Biológica , Modelos Genéticos
17.
Pac Symp Biocomput ; : 297-306, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23424134

RESUMO

Incomplete lineage sorting (ILS) is a common source of gene tree incongruence in multilocus analyses. Numerous approaches have been developed to infer species trees in the presence of ILS. Here we provide a mathematical analysis of several coalescent-based methods. The analysis is performed on a three-taxon species tree and assumes that the gene trees are correctly reconstructed along with their branch lengths. It suggests that maximum likelihood (and some equivalents) can be significantly more accurate in this setting than other methods, especially as ILS gets more pronounced.


Assuntos
Modelos Genéticos , Filogenia , Biologia Computacional , Simulação por Computador , Funções Verossimilhança , Modelos Estatísticos
18.
J Math Biol ; 67(4): 767-97, 2013 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-22875145

RESUMO

Mutation rate variation across loci is well known to cause difficulties, notably identifiability issues, in the reconstruction of evolutionary trees from molecular sequences. Here we introduce a new approach for estimating general rates-across-sites models. Our results imply, in particular, that large phylogenies are typically identifiable under rate variation. We also derive sequence-length requirements for high-probability reconstruction. Our main contribution is a novel algorithm that clusters sites according to their mutation rate. Following this site clustering step, standard reconstruction techniques can be used to recover the phylogeny. Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena.


Assuntos
Interpretação Estatística de Dados , Evolução Molecular , Modelos Genéticos , Mutação , Filogenia , Algoritmos , Sequência de Bases/genética , Análise por Conglomerados
19.
Bull Math Biol ; 73(7): 1627-44, 2011 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-20931293

RESUMO

The accurate reconstruction of phylogenies from short molecular sequences is an important problem in computational biology. Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In Daskalakis et al. (in Probab. Theory Relat. Fields 2010), building on the work of Mossel (Trans. Am. Math. Soc. 356(6):2379-2404, 2004), a tight sequence-length requirement was obtained for the simple CFN model of substitution, that is, the case of a two-state symmetric rate matrix Q. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from O(log n) to poly(n), where n is the number of leaves) at the "critical" branch length g (ML)(Q) (if it exists) of the ancestral reconstruction problem defined roughly as follows: below g (ML)(Q) the sequence at the root can be accurately estimated from sequences at the leaves on deep trees, whereas above g (ML)(Q) information decays exponentially quickly down the tree.Here, we consider a more general evolutionary model, the GTR model, where the q×q rate matrix Q is reversible with q≥2. For this model, recent results of Roch (Preprint, 2009) show that the tree can be accurately reconstructed with sequences of length O(log (n)) when the branch lengths are below g (Lin)(Q), known as the Kesten-Stigum (KS) bound, up to which ancestral sequences can be accurately estimated using simple linear estimators. Although for the CFN model g (ML)(Q)=g (Lin)(Q) (in other words, linear ancestral estimators are in some sense best possible), it is known that for the more general GTR models one has g (ML)(Q)≥g (Lin)(Q) with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models Q and a phylogenetic reconstruction algorithm which recovers the tree from O(log n)-length sequences for some branch lengths in the range (g (Lin)(Q),g (ML)(Q)). Second, we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above g (ML)(Q).


Assuntos
Sequência de Bases , Evolução Molecular , Modelos Genéticos , Filogenia , DNA/química , DNA/genética , Cadeias de Markov
20.
Science ; 327(5971): 1376-9, 2010 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-20223986

RESUMO

The matrix of evolutionary distances is a model-based statistic, derived from molecular sequences, summarizing the pairwise phylogenetic relations between a collection of species. Phylogenetic tree reconstruction methods relying on this matrix are relatively fast and thus widely used in molecular systematics. However, because of their intrinsic reliance on summary statistics, distance-matrix methods are assumed to be less accurate than likelihood-based approaches. In this paper, pairwise sequence comparisons are shown to be more powerful than previously hypothesized. A statistical analysis of certain distance-based techniques indicates that their data requirement for large evolutionary trees essentially matches the conjectured performance of maximum likelihood methods--challenging the idea that summary statistics lead to suboptimal analyses. On the basis of a connection between ancestral state reconstruction and distance averaging, the critical role played by the covariances of the distance matrix is identified.


Assuntos
Algoritmos , Biologia Computacional , Evolução Molecular , Filogenia , Sequência de Bases , Evolução Biológica , DNA/genética , Funções Verossimilhança , Modelos Estatísticos , Mutação , Alinhamento de Sequência , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA