Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
1.
Bull Math Biol ; 86(8): 99, 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38954147

RESUMO

Classification of gene trees is an important task both in the analysis of multi-locus phylogenetic data, and assessment of the convergence of Markov Chain Monte Carlo (MCMC) analyses used in Bayesian phylogenetic tree reconstruction. The logistic regression model is one of the most popular classification models in statistical learning, thanks to its computational speed and interpretability. However, it is not appropriate to directly apply the standard logistic regression model to a set of phylogenetic trees, as the space of phylogenetic trees is non-Euclidean and thus contradicts the standard assumptions on covariates. It is well-known in tropical geometry and phylogenetics that the space of phylogenetic trees is a tropical linear space in terms of the max-plus algebra. Therefore, in this paper, we propose an analogue approach of the logistic regression model in the setting of tropical geometry. Our proposed method outperforms classical logistic regression in terms of Area under the ROC Curve in numerical examples, including with data generated by the multi-species coalescent model. Theoretical properties such as statistical consistency have been proved and generalization error rates have been derived. Finally, our classification algorithm is proposed as an MCMC convergence criterion for Mr Bayes. Unlike the convergence metric used by Mr Bayes which is only dependent on tree topologies, our method is sensitive to branch lengths and therefore provides a more robust metric for convergence. In a test case, it is illustrated that the tropical logistic regression can differentiate between two independently run MCMC chains, even when the standard metric cannot.


Assuntos
Algoritmos , Teorema de Bayes , Cadeias de Markov , Conceitos Matemáticos , Modelos Genéticos , Método de Monte Carlo , Filogenia , Modelos Logísticos , Curva ROC , Simulação por Computador
2.
Methods Mol Biol ; 2802: 267-345, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38819564

RESUMO

Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes. Phylogenomics has significant applications in fields such as evolutionary biology, systematics, comparative genomics, and conservation genetics, providing valuable insights into the origins and relationships of species and contributing to our understanding of biological diversity and evolution. This chapter surveys phylogenetic concepts and methods aimed at both gene tree and species tree reconstruction while also addressing common pitfalls, providing references to relevant computer programs. A practical phylogenomic analysis example including bacterial genomes is presented at the end of the chapter.


Assuntos
Genômica , Filogenia , Genômica/métodos , Software , Evolução Molecular , Genoma Bacteriano , Biologia Computacional/métodos , Bactérias/genética , Bactérias/classificação
3.
Discrete Appl Math ; 343: 65-81, 2024 Jan 30.
Artigo em Inglês | MEDLINE | ID: mdl-38078045

RESUMO

To a given gene tree topology G and species tree topology S with leaves labeled bijectively from a fixed set X, one can associate a set of ancestral configurations, each of which encodes a set of gene lineages that can be found at a given node of the species tree. We introduce a lattice structure on ancestral configurations, studying the directed graphs that provide graphical representations of lattices of ancestral configurations. For a matching gene tree topology and species tree topology G=S, we present a method for defining the digraph of ancestral configurations from the tree topology by using iterated cartesian products of graphs. We show that a specific set of paths on the digraph of ancestral configurations is in bijection with the set of labeled histories - a well-known phylogenetic object that enumerates possible temporal orderings of the coalescences of a tree. For each of a series of tree families, we obtain closed-form expressions for the number of labeled histories by using this bijection to count paths on associated digraphs. Finally, we prove that our lattice construction extends to nonmatching tree pairs, and we use it to characterize pairs (G,S) having the maximal number of ancestral configurations for a fixed G. We discuss how the construction provides new methods for performing enumerations of combinatorial aspects of gene and species trees.

4.
Am J Bot ; 110(8): e16219, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37561649
5.
J Eukaryot Microbiol ; 70(5): e12987, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37282792

RESUMO

Most Parabasalia are symbionts in the hindgut of "lower" (non-Termitidae) termites, where they widely vary in morphology and degree of morphological complexity. Large and complex cells in the class Cristamonadea evolved by replicating a fundamental unit, the karyomastigont, in various ways. We describe here four new species of Calonymphidae (Cristamonadea) from Rugitermes hosts, assigned to the genus Snyderella based on diagnostic features (including the karyomastigont pattern) and molecular phylogeny. We also report a new genus of Calonymphidae, Daimonympha, from Rugitermes laticollis. Daimonympha's morphology does not match that of any known Parabasalia, and its SSU rRNA gene sequence corroborates this distinction. Daimonympha does however share a puzzling feature with a few previously described, but distantly related, Cristamonadea: a rapid, smooth, and continuous rotation of the anterior end of the cell, including the many karyomastigont nuclei. The function of this rotatory movement, the cellular mechanisms enabling it, and the way the cell deals with the consequent cell membrane shear, are all unknown. "Rotating wheel" structures are famously rare in biology, with prokaryotic flagella being the main exception; these mysterious spinning cells found only among Parabasalia are another, far less understood, example.


Assuntos
Isópteros , Parabasalídeos , Animais , Filogenia , América do Sul
6.
J Comput Biol ; 30(2): 161-175, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36251762

RESUMO

Estimating species trees from multiple genes is complicated and challenging due to gene tree-species tree discordance. One of the basic approaches to understanding differences between gene trees and species trees is gene duplication and loss events. Minimize Gene Duplication and Loss (MGDL) is a popular technique for inferring species trees from gene trees when the gene trees are discordant due to gene duplications and losses. Previously, exact algorithms for estimating species trees from rooted, binary trees under MGDL were proposed. However, gene trees are usually estimated using time-reversible mutation models, which result in unrooted trees. In this article, we propose a dynamic programming (DP) algorithm that can be used for an exact but exponential time solution for the case when gene trees are not rooted. We also show that a constrained version of this problem can be solved by this DP algorithm in time that is polynomial in the number of gene trees and taxa. We have proved important structural properties that allow us to extend the algorithms for rooted gene trees to unrooted gene trees. We propose a linear time algorithm for finding the optimal rooted version of an unrooted gene tree given a rooted species tree so that the duplication and loss cost is minimized. Moreover, we prove that the optimal rooting under MGDL is also optimal under the MDC (minimize deep coalescence) criterion. The proposed methods can be applied to both orthologous genes and gene families that by definition include both paralogs and orthologs. Therefore, we hope that these techniques will be useful for estimating species trees from genes sampled throughout the whole genome.


Assuntos
Duplicação Gênica , Modelos Genéticos , Filogenia , Algoritmos
7.
Methods Mol Biol ; 2443: 101-131, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35037202

RESUMO

Gramene is an integrated bioinformatics resource for accessing, visualizing, and comparing plant genomes and biological pathways. Originally targeting grasses, Gramene has grown to host annotations for over 90 plant genomes including agronomically important cereals (e.g., maize, sorghum, wheat, teff), fruits and vegetables (e.g., apple, watermelon, clementine, tomato, cassava), specialty crops (e.g., coffee, olive tree, pistachio, almond), and plants of special or emerging interest (e.g., cotton, tobacco, cannabis, or hemp). For some species, the resource includes multiple varieties of the same species, which has paved the road for the creation of species-specific pan-genome browsers. The resource also features plant research models, including Arabidopsis and C4 warm-season grasses and brassicas, as well as other species that fill phylogenetic gaps for plant evolution studies. Its strength derives from the application of a phylogenetic framework for genome comparison and the use of ontologies to integrate structural and functional annotation data. This chapter outlines system requirements for end-users and database hosting, data types and basic navigation within Gramene, and provides examples of how to (1) explore Gramene's search results, (2) explore gene-centric comparative genomics data visualizations in Gramene, and (3) explore genetic variation associated with a gene locus. This is the first publication describing in detail Gramene's integrated search interface-intended to provide a simplified entry portal for the resource's main data categories (genomic location, phylogeny, gene expression, pathways, and external references) to the most complete and up-to-date set of plant genome and pathway annotations.


Assuntos
Bases de Dados Genéticas , Genoma de Planta , Produtos Agrícolas/genética , Genômica/métodos , Filogenia
8.
Mol Phylogenet Evol ; 161: 107162, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-33831548

RESUMO

Species trees that can generate a nonmatching gene tree topology that is more probable than the topology matching the species tree are said to be in an anomaly zone. We introduce some heuristic approaches to infer whether species trees are in anomaly zones when it is difficult or impossible to compute the entire distribution of gene tree topologies. Here, probabilities of unrooted, unranked, and ranked gene tree topologies under the multispecies coalescent are used. A ranked tree can be viewed as an unranked tree with a temporal ordering of its internal nodes. Overall, considering probabilities of unrooted or unranked gene tree topologies within one nearest neighbor interchange from the species tree topology is a reasonable heuristic to infer the existence of anomalous unrooted or unranked gene trees, respectively. We investigated a test proposed by Linkem et al. (2016) which classifies a species tree as being in an unranked anomaly zone if there is a subset of four taxa in an unranked anomaly zone. We find this test to have high true positive rates, but it can also have high false positive rates. For ranked trees, because at least one of the most probable ranked gene tree topologies must have the same unranked topology as the species tree, we propose to use only those ranked gene trees that have topologies that match the unranked species tree topology. We find that the probability that the species tree is in unrooted and unranked anomaly zones tends to increase with the speciation rate, and the probability of all three types of anomaly zones increases rapidly with the number of taxa. We find that probabilities that species trees are in an anomaly zone can be quite high for moderately high speciation rates.


Assuntos
Especiação Genética , Heurística , Modelos Genéticos , Filogenia , Análise por Conglomerados , Probabilidade
9.
BMC Genomics ; 21(1): 497, 2020 Jul 20.
Artigo em Inglês | MEDLINE | ID: mdl-32689946

RESUMO

BACKGROUND: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. RESULTS: We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. CONCLUSIONS: This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances .


Assuntos
Evolução Biológica , Genoma , Algoritmos , Sequência de Bases , Aprendizado de Máquina , Filogenia
10.
BMC Plant Biol ; 20(1): 340, 2020 Jul 17.
Artigo em Inglês | MEDLINE | ID: mdl-32680458

RESUMO

BACKGROUND: Plastome-scale data have been prevalent in reconstructing the plant Tree of Life. However, phylogenomic studies currently based on plastomes rely primarily on maximum likelihood inference of concatenated alignments of plastid genes, and thus phylogenetic discordance produced by individual plastid genes has generally been ignored. Moreover, structural and functional characteristics of plastomes indicate that plastid genes may not evolve as a single locus and are experiencing different evolutionary forces, yet the genetic characteristics of plastid genes within a lineage remain poorly studied. RESULTS: We sequenced and annotated 10 plastome sequences of Gentianeae. Phylogenomic analyses yielded robust relationships among genera within Gentianeae. We detected great variation of gene tree topologies and revealed that more than half of the genes, including one (atpB) of the three widely used plastid markers (rbcL, atpB and matK) in phylogenetic inference of Gentianeae, are likely contributing to phylogenetic ambiguity of Gentianeae. Estimation of nucleotide substitution rates showed extensive rate heterogeneity among different plastid genes and among different functional groups of genes. Comparative analysis suggested that the ribosomal protein (RPL and RPS) genes and the RNA polymerase (RPO) genes have higher substitution rates and genetic variations among plastid genes in Gentianeae. Our study revealed that just one (matK) of the three (matK, ndhB and rbcL) widely used markers show high phylogenetic informativeness (PI) value. Due to the high PI and lowest gene-tree discordance, rpoC2 is advocated as a promising plastid DNA barcode for taxonomic studies of Gentianeae. Furthermore, our analyses revealed a positive correlation of evolutionary rates with genetic variation of plastid genes, but a negative correlation with gene-tree discordance under purifying selection. CONCLUSIONS: Overall, our results demonstrate the heterogeneity of nucleotide substitution rates and genetic characteristics among plastid genes providing new insights into plastome evolution, while highlighting the necessity of considering gene-tree discordance into phylogenomic studies based on plastome-scale data.


Assuntos
Heterogeneidade Genética , Genomas de Plastídeos/genética , Gentianaceae/genética , Plastídeos/genética , Código de Barras de DNA Taxonômico , Evolução Molecular , Marcadores Genéticos/genética , Nucleotídeos/genética , Filogenia
11.
Mol Ecol Resour ; 20(3)2020 May.
Artigo em Inglês | MEDLINE | ID: mdl-32073732

RESUMO

Multilocus genomic data sets can be used to infer a rich set of information about the evolutionary history of a lineage, including gene trees, species trees, and phylogenetic networks. However, user-friendly tools to run such integrated analyses are lacking, and workflows often require tedious reformatting and handling time to shepherd data through a series of individual programs. Here, we present a tool written in Python-TREEasy-that performs automated sequence alignment (with MAFFT), gene tree inference (with IQ-Tree), species inference from concatenated data (with IQ-Tree and RaxML-NG), species tree inference from gene trees (with ASTRAL, MP-EST, and STELLS2), and phylogenetic network inference (with SNaQ and PhyloNet). The tool only requires FASTA files and nine parameters as inputs. The tool can be run as command line or through a Graphical User Interface (GUI). As examples, we reproduced a recent analysis of staghorn coral evolution, and performed a new analysis on the evolution of the "WGD clade" of yeast. The latter revealed novel patterns that were not identified by previous analyses. TREEasy represents a reliable and simple tool to accelerate research in systematic biology (https://github.com/MaoYafei/TREEasy).


Assuntos
Automação Laboratorial/métodos , Biologia Computacional/métodos , Genômica/métodos , Alinhamento de Sequência/métodos , Algoritmos , Filogenia , Software , Fluxo de Trabalho
12.
New Phytol ; 226(5): 1492-1505, 2020 06.
Artigo em Inglês | MEDLINE | ID: mdl-31990988

RESUMO

●Cells are continuously exposed to chemical signals that they must discriminate between and respond to appropriately. In embryophytes, the leucine-rich repeat receptor-like kinases (LRR-RLKs) are signal receptors critical in development and defense. LRR-RLKs have diversified to hundreds of genes in many plant genomes. Although intensively studied, a well-resolved LRR-RLK gene tree has remained elusive. ●To resolve the LRR-RLK gene tree, we developed an improved gene discovery method based on iterative hidden Markov model searching and phylogenetic inference. We used this method to infer complete gene trees for each of the LRR-RLK subclades and reconstructed the deepest nodes of the full gene family. ●We discovered that the LRR-RLK gene family is even larger than previously thought, and that protein domain gains and losses are prevalent. These structural modifications, some of which likely predate embryophyte diversification, led to misclassification of some LRR-RLK variants as members of other gene families. Our work corrects this misclassification. ●Our results reveal ongoing structural evolution generating novel LRR-RLK genes. These new genes are raw material for the diversification of signaling in development and defense. Our methods also enable phylogenetic reconstruction in any large gene family.


Assuntos
Evolução Molecular , Genoma de Planta , Filogenia , Domínios Proteicos
13.
BMC Evol Biol ; 19(1): 203, 2019 11 06.
Artigo em Inglês | MEDLINE | ID: mdl-31694538

RESUMO

BACKGROUND: The flood of genomic data to help build and date the tree of life requires automation at several critical junctures, most importantly during sequence assembly and alignment. It is widely appreciated that automated alignment protocols can yield inaccuracies, but the relative impact of various sources error on phylogenomic analysis is not yet known. This study employs an updated mammal data set of 5162 coding loci sampled from 90 species to evaluate the effects of alignment uncertainty, substitution models, and fossil priors on gene tree, species tree, and divergence time estimation. Additionally, a novel coalescent likelihood ratio test is introduced for comparing competing species trees against a given set of gene trees. RESULTS: The aligned DNA sequences of 5162 loci from 90 species were trimmed and filtered using trimAL and two filtering protocols. The final dataset contains 4 sets of alignments - before trimming, after trimming, filtered by a recently proposed pipeline, and further filtered by comparing ML gene trees for each locus with the concatenation tree. Our analyses suggest that the average discordance among the coalescent trees is significantly smaller than that among the concatenation trees estimated from the 4 sets of alignments or with different substitution models. There is no significant difference among the divergence times estimated with different substitution models. However, the divergence dates estimated from the alignments after trimming are more recent than those estimated from the alignments before trimming. CONCLUSIONS: Our results highlight that alignment uncertainty of the updated mammal data set and the choice of substitution models have little impact on tree topologies yielded by coalescent methods for species tree estimation, whereas they are more influential on the trees made by concatenation. Given the choice of calibration scheme and clock models, divergence time estimates are robust to the choice of substitution models, but removing alignments deemed problematic by trimming algorithms can lead to more recent dates. Although the fossil prior is important in divergence time estimation, Bayesian estimates of divergence times in this data set are driven primarily by the sequence data.


Assuntos
Mamíferos/classificação , Mamíferos/genética , Filogenia , Algoritmos , Animais , Teorema de Bayes , Simulação por Computador , Fósseis , Genoma , Modelos Genéticos , Incerteza
14.
Am J Bot ; 106(9): 1219-1228, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31535720

RESUMO

PREMISE: Although hybridization has played an important role in the evolution of many plant species, phylogenetic reconstructions that include hybridizing lineages have been historically constrained by the available models and data. Restriction-site-associated DNA sequencing (RADseq) has been a popular sequencing technique for the reconstruction of hybridization in the next-generation sequencing era. However, the utility of RADseq for the reconstruction of complex evolutionary networks has not been thoroughly investigated. Conflicting phylogenetic relationships in the genus Medicago have been mainly attributed to hybridization, but the specific hybrid origins of taxa have not been yet clarified. METHODS: We obtained new molecular data from diploid species of Medicago section Medicago using single-digest RADseq to reconstruct evolutionary networks from gene trees, an approach that is computationally tractable with data sets that include several species and complex hybridization patterns. RESULTS: Our analyses revealed that assembly filters to exclusively select a small set of loci with high phylogenetic information led to the most-divergent network topologies. Conversely, alternative clustering thresholds or filters on the number of samples per locus had a lower impact on networks. A strong hybridization signal was detected for M. carstiensis and M. cretacea, while signals were less clear for M. rugosa, M. rhodopea, M. suffruticosa, M. marina, M. scutellata, and M. sativa. CONCLUSIONS: Complex network reconstructions from RADseq gene trees were not robust under variations of the assembly parameters and filters. But when the most-divergent networks were discarded, all remaining analyses consistently supported a hybrid origin for M. carstiensis and M. cretacea.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Medicago , Sequência de Bases , Filogenia , Análise de Sequência de DNA
15.
Am J Bot ; 106(5): 679-689, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-31081928

RESUMO

PREMISE: Parasitic plants with large geographic ranges, and different hosts in parts of their range, may acquire horizontally transferred genes (HGTs), which might sometimes leave a footprint of gradual host and range expansion. Cynomorium coccineum, the only member of the Saxifragales family Cynomoriaceae, is a root holoparasite that occurs in water-stressed habitats from western China to the Canary Islands. It parasitizes at least 10 angiosperm families from different orders, some of them only in parts of its range. This parasite therefore offers an opportunity to trace HGTs as long as parasite-host pairs can be obtained and sequenced. METHODS: By sequencing mitochondrial, plastid, and nuclear loci from parasite-host pairs from throughout the parasite's range and with prior information from completely assembled mitochondrial and plastid genomes, we detected 10 HGTs of five mitochondrial genes. RESULTS: The 10 HGTs appear to have occurred sequentially as C. coccineum expanded from East to West. Molecular-clock models yield Cynomorium stem ages between 66 and 156 Myr, with relaxed clocks converging on 66-67 Myr. Chinese Sapindales, probably Nitraria, were the first source of transferred genes, followed by Iranian and Mediterranean Caryophyllales. The most recently acquired gene appears to come from a Tamarix host in the Iberian Peninsula. CONCLUSIONS: Data on HGTs that have accumulated over the past 15 years, along with this discovery of multiple HGTs within a single widespread species, underline the need for more whole-genome data from parasite-host pairs to investigate whether and how transferred copies coexist with, or replace, native functional genes.


Assuntos
Cynomorium/genética , Transferência Genética Horizontal , Genes de Plantas , Genoma Mitocondrial , Genomas de Plastídeos , Dispersão Vegetal/genética , Genes Mitocondriais , Itália
16.
Mol Phylogenet Evol ; 138: 219-232, 2019 09.
Artigo em Inglês | MEDLINE | ID: mdl-31146023

RESUMO

The current classification of angiosperms is based primarily on concatenated plastid markers and maximum likelihood (ML) inference. This approach has been justified by the assumption that plastid DNA (ptDNA) is inherited as a single locus and that its individual genes produce congruent trees. However, structural and functional characteristics of ptDNA suggest that plastid genes may not evolve as a single locus and are experiencing different evolutionary forces. To examine this idea, we produced new complete plastid genome (plastome) sequences of 27 species and combined these data with publicly available sequences to produce a final dataset that includes 78 plastid genes for 89 species of rosids and five outgroups. We used four data matrices (i.e., gene, exon, codon-aligned, and amino acid) to infer species and gene trees using ML and multispecies coalescent (MSC) methods. Rosids include about one third of all angiosperms and their two major clades, fabids and malvids, were recovered in almost all analyses. However, we detected incongruence between species trees inferred with different matrices and methods and previously published plastid and nuclear phylogenies. We visualized and tested the significance of incongruence between gene trees and species trees. We then measured the distribution of phylogenetic signal across sites and genes supporting alternative placements of five controversial nodes at different taxonomic levels. Gene trees inferred with plastid data often disagree with species trees inferred using both ML (with unpartitioned or partitioned data) and MSC. Species trees inferred with both methods produced alternative topologies for a few taxa. Our results show that, in a phylogenetic context, plastid protein-coding genes may not be fully linked and behaving as a single locus. Furthermore, concatenated matrices may produce highly supported phylogenies that are discordant with individual gene trees. We also show that phylogenies inferred with MSC are accurate. We therefore emphasize the importance of considering variation in phylogenetic signal across plastid genes and the exploration of plastome data to increase accuracy of estimating relationships. We also support the use of MSC with plastome matrices in future phylogenomic investigations.


Assuntos
Genes de Plantas , Filogenia , Plastídeos/genética , Sequência de Bases , Sequência Consenso/genética , Genomas de Plastídeos , Funções Verossimilhança , Magnoliopsida/genética , Análise de Componente Principal , Especificidade da Espécie
17.
Algorithms Mol Biol ; 14: 7, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30930955

RESUMO

Reconciling gene trees with a species tree is a fundamental problem to understand the evolution of gene families. Many existing approaches reconcile each gene tree independently. However, it is well-known that the evolution of gene families is interconnected. In this paper, we extend a previous approach to reconcile a set of gene trees with a species tree based on segmental macro-evolutionary events, where segmental duplication events and losses are associated with cost δ and λ , respectively. We show that the problem is polynomial-time solvable when δ ≤ λ (via LCA-mapping), while if δ > λ the problem is NP-hard, even when λ = 0 and a single gene tree is given, solving a long standing open problem on the complexity of multi-gene reconciliation. On the positive side, we give a fixed-parameter algorithm for the problem, where the parameters are δ / λ and the number d of segmental duplications, of time complexity O ⌈ δ λ ⌉ d · n · δ λ . Finally, we demonstrate the usefulness of this algorithm on two previously studied real datasets: we first show that our method can be used to confirm or raise doubt on hypothetical segmental duplications on a set of 16 eukaryotes, then show how we can detect whole genome duplications in yeast genomes.

18.
Theor Popul Biol ; 129: 133-147, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-29729946

RESUMO

Reciprocal monophyly, a feature of a genealogy in which multiple groups of descendant lineages each consist of all of the descendants of their respective most recent common ancestors, has been an important concept in studies of species delimitation, phylogeography, population history reconstruction, systematics, and conservation. Computations involving the probability that reciprocal monophyly is observed in a genealogy have played a key role in criteria for defining taxonomic groups and inferring divergence times. The probability of reciprocal monophyly under a coalescent model of population divergence has been studied in detail for groups of gene lineages for pairs of species. Here, we extend this computation to generate corresponding probabilities for sets of gene lineages from three and four species. We study the effects of model parameters on the probability of reciprocal monophyly, finding that it is driven primarily by species tree height, with lesser but still substantial influences of internal branch lengths and sample sizes. We also provide an example application of our results to data from maize and teosinte.


Assuntos
Modelos Genéticos , Filogenia , Árvores/genética , Probabilidade
19.
Bull Math Biol ; 81(2): 384-407, 2019 02.
Artigo em Inglês | MEDLINE | ID: mdl-28913585

RESUMO

An ancestral configuration is one of the combinatorially distinct sets of gene lineages that, for a given gene tree, can reach a given node of a specified species tree. Ancestral configurations have appeared in recursive algebraic computations of the conditional probability that a gene tree topology is produced under the multispecies coalescent model for a given species tree. For matching gene trees and species trees, we study the number of ancestral configurations, considered up to an equivalence relation introduced by Wu (Evolution 66:763-775, 2012) to reduce the complexity of the recursive probability computation. We examine the largest number of non-equivalent ancestral configurations possible for a given tree size n. Whereas the smallest number of non-equivalent ancestral configurations increases polynomially with n, we show that the largest number increases with [Formula: see text], where k is a constant that satisfies [Formula: see text]. Under a uniform distribution on the set of binary labeled trees with a given size n, the mean number of non-equivalent ancestral configurations grows exponentially with n. The results refine an earlier analysis of the number of ancestral configurations considered without applying the equivalence relation, showing that use of the equivalence relation does not alter the exponential nature of the increase with tree size.


Assuntos
Modelos Genéticos , Filogenia , Algoritmos , Biologia Computacional , Evolução Molecular , Especiação Genética , Conceitos Matemáticos , Modelos Estatísticos , Probabilidade
20.
J Math Biol ; 78(1-2): 155-188, 2019 01.
Artigo em Inglês | MEDLINE | ID: mdl-30116881

RESUMO

Compact coalescent histories are combinatorial structures that describe for a given gene tree G and species tree S possibilities for the numbers of coalescences of G that take place on the various branches of S. They have been introduced as a data structure for evaluating probabilities of gene tree topologies conditioning on species trees, reducing computation time compared to standard coalescent histories. When gene trees and species trees have a matching labeled topology [Formula: see text], the compact coalescent histories of t are encoded by particular integer labelings of the branches of t, each integer specifying the number of coalescent events of G present in a branch of S. For matching gene trees and species trees, we investigate enumerative properties of compact coalescent histories. We report a recursion for the number of compact coalescent histories for matching gene trees and species trees, using it to study the numbers of compact coalescent histories for small trees. We show that the number of compact coalescent histories equals the number of coalescent histories if and only if the labeled topology is a caterpillar or a bicaterpillar. The number of compact coalescent histories is seen to increase with tree imbalance: we prove that as the number of taxa n increases, the exponential growth of the number of compact coalescent histories follows [Formula: see text] in the case of caterpillar or bicaterpillar labeled topologies and approximately [Formula: see text] and [Formula: see text] for lodgepole and balanced topologies, respectively. We prove that the mean number of compact coalescent histories of a labeled topology of size n selected uniformly at random grows with [Formula: see text]. Our results contribute to the analysis of the computational complexity of algorithms for computing gene tree probabilities, and to the combinatorial study of gene trees and species trees more generally.


Assuntos
Especiação Genética , Modelos Genéticos , Filogenia , Algoritmos , Biologia Computacional , Evolução Molecular , Genética Populacional/estatística & dados numéricos , Conceitos Matemáticos , Probabilidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA