Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
1.
Syst Biol ; 72(5): 1171-1179, 2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37254872

RESUMO

We consider the evolution of phylogenetic gene trees along phylogenetic species networks, according to the network multispecies coalescent process, and introduce a new network coalescent model with correlated inheritance of gene flow. This model generalizes two traditional versions of the network coalescent: with independent or common inheritance. At each reticulation, multiple lineages of a given locus are inherited from parental populations chosen at random, either independently across lineages or with positive correlation according to a Dirichlet process. This process may account for locus-specific probabilities of inheritance, for example. We implemented the simulation of gene trees under these network coalescent models in the Julia package PhyloCoalSimulations, which depends on PhyloNetworks and its powerful network manipulation tools. Input species phylogenies can be read in extended Newick format, either in numbers of generations or in coalescent units. Simulated gene trees can be written in Newick format, and in a way that preserves information about their embedding within the species network. This embedding can be used for downstream purposes, such as to simulate species-specific processes like rate variation across species, or for other scenarios as illustrated in this note. This package should be useful for simulation studies and simulation-based inference methods. The software is available open source with documentation and a tutorial at https://github.com/cecileane/PhyloCoalSimulations.jl.


Assuntos
Fluxo Gênico , Software , Filogenia , Simulação por Computador , Probabilidade , Modelos Genéticos
2.
Bull Math Biol ; 86(9): 110, 2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39052074

RESUMO

When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors-the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network. Addressing these identifiability issues is essential for designing statistically consistent inference methods.


Assuntos
Transferência Genética Horizontal , Conceitos Matemáticos , Modelos Genéticos , Filogenia , Evolução Molecular , Especiação Genética , Redes Reguladoras de Genes , Simulação por Computador , Hibridização Genética
3.
J Math Biol ; 88(3): 29, 2024 02 19.
Artigo em Inglês | MEDLINE | ID: mdl-38372830

RESUMO

Reticulations in a phylogenetic network represent processes such as gene flow, admixture, recombination and hybrid speciation. Extending definitions from the tree setting, an anomalous network is one in which some unrooted tree topology displayed in the network appears in gene trees with a lower frequency than a tree not displayed in the network. We investigate anomalous networks under the Network Multispecies Coalescent Model with possible correlated inheritance at reticulations. Focusing on subsets of 4 taxa, we describe a new algorithm to calculate quartet concordance factors on networks of any level, faster than previous algorithms because of its focus on 4 taxa. We then study topological properties required for a 4-taxon network to be anomalous, uncovering the key role of [Formula: see text]-cycles: cycles of 3 edges parent to a sister group of 2 taxa. Under the model of common inheritance, that is, when each gene tree coalesces within a species tree displayed in the network, we prove that 4-taxon networks are never anomalous. Under independent and various levels of correlated inheritance, we use simulations under realistic parameters to quantify the prevalence of anomalous 4-taxon networks, finding that truly anomalous networks are rare. At the same time, however, we find a significant fraction of networks close enough to the anomaly zone to appear anomalous, when considering the quartet concordance factors observed from a few hundred genes. These apparent anomalies may challenge network inference methods.


Assuntos
Algoritmos , Prevalência , Filogenia
4.
Syst Biol ; 71(4): 929-942, 2022 06 16.
Artigo em Inglês | MEDLINE | ID: mdl-33560348

RESUMO

A simple graphical device, the simplex plot of quartet concordance factors, is introduced to aid in the exploration of a collection of gene trees on a common set of taxa. A single plot summarizes all gene tree discord and allows for visual comparison to the expected discord from the multispecies coalescent model (MSC) of incomplete lineage sorting on a species tree. A formal statistical procedure is described that can quantify the deviation from expectation for each subset of four taxa, suggesting when the data are not in accord with the MSC, and thus that either gene tree inference error is substantial or a more complex model such as that on a network may be required. If the collection of gene trees is in accord with the MSC, the plots reveal when substantial incomplete lineage sorting is present. Applications to both simulated and empirical multilocus data sets illustrate the insights provided. [Gene tree discordance; hypothesis test; multispecies coalescent model; quartet concordance factor; simplex plot; species tree].


Assuntos
Especiação Genética , Modelos Genéticos , Simulação por Computador , Filogenia
5.
Bioinformatics ; 37(12): 1766-1768, 2021 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-33031510

RESUMO

SUMMARY: MSCquartets is an R package for species tree hypothesis testing, inference of species trees and inference of species networks under the Multispecies Coalescent model of incomplete lineage sorting and its network analog. Input for these analyses are collections of metric or topological locus trees which are then summarized by the quartets displayed on them. Results of hypothesis tests at user-supplied levels are displayed in a simplex plot by color-coded points. The package implements the QDC and WQDC algorithms for topological and metric species tree inference, and the NANUQ algorithm for level-1 topological species network inference, all of which give statistically consistent estimators under the model. AVAILABILITY AND IMPLEMENTATION: MSCquartets is available through the Comprehensive R Archive Network: https://CRAN.R-project.org/package=MSCquartets.


Assuntos
Algoritmos , Modelos Genéticos , Filogenia
6.
J Math Biol ; 84(5): 35, 2022 04 07.
Artigo em Inglês | MEDLINE | ID: mdl-35385988

RESUMO

Inference of network-like evolutionary relationships between species from genomic data must address the interwoven signals from both gene flow and incomplete lineage sorting. The heavy computational demands of standard approaches to this problem severely limit the size of datasets that may be analyzed, in both the number of species and the number of genetic loci. Here we provide a theoretical pointer to more efficient methods, by showing that logDet distances computed from genomic-scale sequences retain sufficient information to recover network relationships in the level-1 ultrametric case. This result is obtained under the Network Multispecies Coalescent model combined with a mixture of General Time-Reversible sequence evolution models across individual gene trees. It applies to both unlinked site data, such as for SNPs, and to sequence data in which many contiguous sites may have evolved on a common tree, such as concatenated gene sequences. Thus under standard stochastic models statistically justifiable inference of network relationships from sequences can be accomplished without consideration of individual genes or gene trees.


Assuntos
Genômica , Modelos Genéticos , Filogenia
7.
J Math Biol ; 86(1): 10, 2022 12 06.
Artigo em Inglês | MEDLINE | ID: mdl-36472708

RESUMO

Inference of species networks from genomic data under the Network Multispecies Coalescent Model is currently severely limited by heavy computational demands. It also remains unclear how complicated networks can be for consistent inference to be possible. As a step toward inferring a general species network, this work considers its tree of blobs, in which non-cut edges are contracted to nodes, so only tree-like relationships between the taxa are shown. An identifiability theorem, that most features of the unrooted tree of blobs can be determined from the distribution of gene quartet topologies, is established. This depends upon an analysis of gene quartet concordance factors under the model, together with a new combinatorial inference rule. The arguments for this theoretical result suggest a practical algorithm for tree of blobs inference, to be fully developed in a subsequent work.


Assuntos
Genômica
8.
Syst Biol ; 66(4): 620-636, 2017 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-28123114

RESUMO

Detecting variation in the evolutionary process along chromosomes is increasingly important as whole-genome data become more widely available. For example, factors such as incomplete lineage sorting, horizontal gene transfer, and chromosomal inversion are expected to result in changes in the underlying gene trees along a chromosome, while changes in selective pressure and mutational rates for different genomic regions may lead to shifts in the underlying mutational process. We propose the split score as a general method for quantifying support for a particular phylogenetic relationship within a genomic data set. Because the split score is based on algebraic properties of a matrix of site pattern frequencies, it can be rapidly computed, even for data sets that are large in the number of taxa and/or in the length of the alignment, providing an advantage over other methods (e.g., maximum likelihood) that are often used to assess such support. Using simulation, we explore the properties of the split score, including its dependence on sequence length, branch length, size of a split and its ability to detect true splits in the underlying tree. Using a sliding window analysis, we show that split scores can be used to detect changes in the underlying evolutionary process for genome-scale data from primates, mosquitoes, and viruses in a computationally efficient manner. Computation of the split score has been implemented in the software package SplitSup.


Assuntos
Classificação/métodos , Filogenia , Animais , Culicidae/classificação , Culicidae/genética , Evolução Molecular , Transferência Genética Horizontal , Genoma/genética , Primatas/classificação , Primatas/genética , Software , Vírus/classificação , Vírus/genética
9.
Bull Math Biol ; 80(1): 64-103, 2018 01.
Artigo em Inglês | MEDLINE | ID: mdl-29127546

RESUMO

Using topological summaries of gene trees as a basis for species tree inference is a promising approach to obtain acceptable speed on genomic-scale datasets, and to avoid some undesirable modeling assumptions. Here we study the probabilities of splits on gene trees under the multispecies coalescent model, and how their features might inform species tree inference. After investigating the behavior of split consensus methods, we investigate split invariants-that is, polynomial relationships between split probabilities. These invariants are then used to show that, even though a split is an unrooted notion, split probabilities retain enough information to identify the rooted species tree topology for trees of 5 or more taxa, with one possible 6-taxon exception.


Assuntos
Modelos Genéticos , Filogenia , Evolução Molecular , Especiação Genética , Modelos Lineares , Conceitos Matemáticos , Probabilidade
10.
bioRxiv ; 2024 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-38712257

RESUMO

The tree of blobs of a species network shows only the tree-like aspects of relationships of taxa on a network, omitting information on network substructures where hybridization or other types of lateral transfer of genetic information occur. By isolating such regions of a network, inference of the tree of blobs can serve as a starting point for a more detailed investigation, or indicate the limit of what may be inferrable without additional assumptions. Building on our theoretical work on the identifiability of the tree of blobs from gene quartet distributions under the Network Multispecies Coalescent model, we develop an algorithm, TINNiK, for statistically consistent tree of blobs inference. We provide examples of its application to both simulated and empirical datasets, utilizing an implementation in the MSCquartets 2.0 R package.

11.
ArXiv ; 2024 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-38259350

RESUMO

When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors - the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network.

12.
Syst Biol ; 61(6): 1049-59, 2012 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-22798332

RESUMO

Phylogenetic mixture models, in which the sites in sequences undergo different substitution processes along the same or different trees, allow the description of heterogeneous evolutionary processes. As data sets consisting of longer sequences become available, it is important to understand such models, for both theoretical insights and use in statistical analyses. Some recent articles have highlighted disturbing "mimicking" behavior in which a distribution from a mixture model is identical to one arising on a different tree or trees. Other works have indicated such problems are unlikely to occur in practice, as they require very special parameter choices. After surveying some of these works on mixture models, we give several new results. In general, if the number of components in a generating mixture is not too large and we disallow zero or infinite branch lengths, then it cannot mimic the behavior of a nonmixture on a different tree. On the other hand, if the mixture model is locally overparameterized, it is possible for a phylogenetic mixture model to mimic distributions of another tree model. Although theoretical questions remain, these sorts of results can serve as a guide to when the use of mixture models in either maximum likelihood or Bayesian frameworks is likely to lead to statistically consistent inference, and when mimicking due to heterogeneity should be considered a realistic possibility. [Phylogenetic mixture models; parameter identifiability; heterogeneous sequence evolution.].


Assuntos
Modelos Biológicos , Filogenia , Algoritmos , Simulação por Computador , Interpretação Estatística de Dados
13.
IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1613-1618, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-35617176

RESUMO

As genomic-scale datasets motivate research on species tree inference, simulators of the multispecies coalescent (MSC) process have become essential for the testing and evaluation of new inference methods. However, the simulators themselves must be tested to ensure that they give valid samples. This work develops methods for checking whether a collection of gene trees is in accord with the MSC model on a given species tree. When applied to well-known simulators, we find that several give flawed samples. The tests presented are capable of validating both topological and metric properties of gene tree samples, and are implemented in a freely available R package MSCsimtester so that developers and users may easily apply them.

14.
J Comput Biol ; 30(3): 277-292, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36745414

RESUMO

Diversification models describe the random growth of evolutionary trees, modeling the historical relationships of species through speciation and extinction events. One class of such models allows for independently changing traits, or types, of the species within the tree, upon which speciation and extinction rates depend. Although identifiability of parameters is necessary to justify parameter estimation with a model, it has not been formally established for these models, despite their adoption for inference. This work establishes generic identifiability up to label swapping for the parameters of one of the simpler forms of such a model, a multitype pure birth model of speciation, from an asymptotic distribution derived from a single tree observation as its depth goes to infinity. Crucially for applications to available data, no observation of types is needed at any internal points in the tree, nor even at the leaves.


Assuntos
Evolução Biológica , Especiação Genética , Filogenia , Modelos Biológicos
15.
ArXiv ; 2023 Mar 13.
Artigo em Inglês | MEDLINE | ID: mdl-36994151

RESUMO

A model of genomic sequence evolution on a species tree should include not only a sequence substitution process, but also a coalescent process, since different sites may evolve on different gene trees due to incomplete lineage sorting. Chifman and Kubatko initiated the study of such models, leading to the development of the SVDquartets methods of species tree inference. A key observation was that symmetries in an ultrametric species tree led to symmetries in the joint distribution of bases at the taxa. In this work, we explore the implications of such symmetry more fully, defining new models incorporating only the symmetries of this distribution, regardless of the mechanism that might have produced them. The models are thus supermodels of many standard ones with mechanistic parameterizations. We study phylogenetic invariants for the models, and establish identifiability of species tree topologies using them.

16.
bioRxiv ; 2023 Aug 21.
Artigo em Inglês | MEDLINE | ID: mdl-37662314

RESUMO

Reticulations in a phylogenetic network represent processes such as gene flow, admixture, recombination and hybrid speciation. Extending definitions from the tree setting, an anomalous network is one in which some unrooted tree topology displayed in the network appears in gene trees with a lower frequency than a tree not displayed in the network. We investigate anomalous networks under the Network Multispecies Coalescent Model with possible correlated inheritance at reticulations. Focusing on subsets of 4 taxa, we describe a new algorithm to calculate quartet concordance factors on networks of any level, faster than previous algorithms because of its focus on 4 taxa. We then study topological properties required for a 4-taxon network to be anomalous, uncovering the key role of 32-cycles: cycles of 3 edges parent to a sister group of 2 taxa. Under the model of common inheritance, that is, when each gene tree coalesces within a species tree displayed in the network, we prove that 4-taxon networks are never anomalous. Under independent and various levels of correlated inheritance, we use simulations under realistic parameters to quantify the prevalence of anomalous 4-taxon networks, finding that truly anomalous networks are rare. At the same time, however, we find a significant fraction of networks close enough to the anomaly zone to appear anomalous, when considering the quartet concordance factors observed from a few hundred genes. These apparent anomalies may challenge network inference methods.

17.
J Theor Biol ; 289: 96-106, 2011 Nov 21.
Artigo em Inglês | MEDLINE | ID: mdl-21867714

RESUMO

One approach to estimating a species tree from a collection of gene trees is to first estimate probabilities of clades from the gene trees, and then to construct the species tree from the estimated clade probabilities. While a greedy consensus algorithm, which consecutively accepts the most probable clades compatible with previously accepted clades, can be used for this second stage, this method is known to be statistically inconsistent under the multispecies coalescent model. This raises the question of whether it is theoretically possible to reconstruct the species tree from known probabilities of clades on gene trees. We investigate clade probabilities arising from the multispecies coalescent model, with an eye toward identifying features of the species tree. Clades on gene trees with probability greater than 1/3 are shown to reflect clades on the species tree, while those with smaller probabilities may not. Linear invariants of clade probabilities are studied both computationally and theoretically, with certain linear invariants giving insight into the clade structure of the species tree. For species trees with generic edge lengths, these invariants can be used to identify the species tree topology. These theoretical results both confirm that clade probabilities contain full information on the species tree topology and suggest future directions of study for developing statistically consistent inference methods from clade frequencies on gene trees.


Assuntos
Evolução Biológica , Especiação Genética , Modelos Genéticos , Algoritmos , Animais , Filogenia
18.
J Math Biol ; 62(6): 833-62, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-20652704

RESUMO

Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals-each with many genes-splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree. This multispecies coalescent model provides a framework for phylogeneticists to infer species trees from gene trees using maximum likelihood or Bayesian approaches. Because the coalescent models a branching process over time, all trees are typically assumed to be rooted in this setting. Often, however, gene trees inferred by traditional phylogenetic methods are unrooted. We investigate probabilities of unrooted gene trees under the multispecies coalescent model. We show that when there are four species with one gene sampled per species, the distribution of unrooted gene tree topologies identifies the unrooted species tree topology and some, but not all, information in the species tree edges (branch lengths). The location of the root on the species tree is not identifiable in this situation. However, for 5 or more species with one gene sampled per species, we show that the distribution of unrooted gene tree topologies identifies the rooted species tree topology and all its internal branch lengths. The length of any pendant branch leading to a leaf of the species tree is also identifiable for any species from which more than one gene is sampled.


Assuntos
Evolução Biológica , Especiação Genética , Modelos Genéticos , Filogenia
19.
J Comput Biol ; 28(6): 570-586, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-33960831

RESUMO

A profile mixture (PM) model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend, in part, on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justify model-based inference, is whether the parameters are identifiable from the probability distribution they determine. Here, using algebraic methods, we show that a PM model has identifiable parameters under circumstances in which it is likely to be used for empirical analyses. In particular, for a tree relating 9 or more taxa, both the tree topology and all numerical parameters are generically identifiable when the number of profiles is less than 74.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Análise de Sequência de Proteína/métodos , Animais , Humanos , Cadeias de Markov , Proteínas/química , Proteínas/genética
20.
J Theor Biol ; 263(1): 108-19, 2010 Mar 07.
Artigo em Inglês | MEDLINE | ID: mdl-20004210

RESUMO

As an alternative to parsimony analyses, stochastic models have been proposed (Lewis, 2001; Nylander et al., 2004) for morphological characters, so that maximum likelihood or Bayesian analyses may be used for phylogenetic inference. A key feature of these models is that they account for ascertainment bias, in that only varying, or parsimony-informative characters are observed. However, statistical consistency of such model-based inference requires that the model parameters be identifiable from the joint distribution they entail, and this issue has not been addressed. Here we prove that parameters for several such models, with finite state spaces of arbitrary size, are identifiable, provided the tree has at least eight leaves. If the tree topology is already known, then seven leaves suffice for identifiability of the numerical parameters. The method of proof involves first inferring a full distribution of both parsimony-informative and non-informative pattern joint probabilities from the parsimony-informative ones, using phylogenetic invariants. The failure of identifiability of the tree parameter for four-taxon trees is also investigated.


Assuntos
Biologia Computacional/métodos , Filogenia , Algoritmos , Teorema de Bayes , Classificação , Funções Verossimilhança , Cadeias de Markov , Modelos Genéticos , Modelos Estatísticos , Modelos Teóricos , Probabilidade , Terminologia como Assunto
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA