RESUMEN
Classification of gene trees is an important task both in the analysis of multi-locus phylogenetic data, and assessment of the convergence of Markov Chain Monte Carlo (MCMC) analyses used in Bayesian phylogenetic tree reconstruction. The logistic regression model is one of the most popular classification models in statistical learning, thanks to its computational speed and interpretability. However, it is not appropriate to directly apply the standard logistic regression model to a set of phylogenetic trees, as the space of phylogenetic trees is non-Euclidean and thus contradicts the standard assumptions on covariates. It is well-known in tropical geometry and phylogenetics that the space of phylogenetic trees is a tropical linear space in terms of the max-plus algebra. Therefore, in this paper, we propose an analogue approach of the logistic regression model in the setting of tropical geometry. Our proposed method outperforms classical logistic regression in terms of Area under the ROC Curve in numerical examples, including with data generated by the multi-species coalescent model. Theoretical properties such as statistical consistency have been proved and generalization error rates have been derived. Finally, our classification algorithm is proposed as an MCMC convergence criterion for Mr Bayes. Unlike the convergence metric used by Mr Bayes which is only dependent on tree topologies, our method is sensitive to branch lengths and therefore provides a more robust metric for convergence. In a test case, it is illustrated that the tropical logistic regression can differentiate between two independently run MCMC chains, even when the standard metric cannot.
Asunto(s)
Algoritmos , Teorema de Bayes , Cadenas de Markov , Conceptos Matemáticos , Modelos Genéticos , Método de Montecarlo , Filogenia , Modelos Logísticos , Curva ROC , Simulación por ComputadorRESUMEN
To a given gene tree topology G and species tree topology S with leaves labeled bijectively from a fixed set X, one can associate a set of ancestral configurations, each of which encodes a set of gene lineages that can be found at a given node of the species tree. We introduce a lattice structure on ancestral configurations, studying the directed graphs that provide graphical representations of lattices of ancestral configurations. For a matching gene tree topology and species tree topology G=S, we present a method for defining the digraph of ancestral configurations from the tree topology by using iterated cartesian products of graphs. We show that a specific set of paths on the digraph of ancestral configurations is in bijection with the set of labeled histories - a well-known phylogenetic object that enumerates possible temporal orderings of the coalescences of a tree. For each of a series of tree families, we obtain closed-form expressions for the number of labeled histories by using this bijection to count paths on associated digraphs. Finally, we prove that our lattice construction extends to nonmatching tree pairs, and we use it to characterize pairs (G,S) having the maximal number of ancestral configurations for a fixed G. We discuss how the construction provides new methods for performing enumerations of combinatorial aspects of gene and species trees.
RESUMEN
Most Parabasalia are symbionts in the hindgut of "lower" (non-Termitidae) termites, where they widely vary in morphology and degree of morphological complexity. Large and complex cells in the class Cristamonadea evolved by replicating a fundamental unit, the karyomastigont, in various ways. We describe here four new species of Calonymphidae (Cristamonadea) from Rugitermes hosts, assigned to the genus Snyderella based on diagnostic features (including the karyomastigont pattern) and molecular phylogeny. We also report a new genus of Calonymphidae, Daimonympha, from Rugitermes laticollis. Daimonympha's morphology does not match that of any known Parabasalia, and its SSU rRNA gene sequence corroborates this distinction. Daimonympha does however share a puzzling feature with a few previously described, but distantly related, Cristamonadea: a rapid, smooth, and continuous rotation of the anterior end of the cell, including the many karyomastigont nuclei. The function of this rotatory movement, the cellular mechanisms enabling it, and the way the cell deals with the consequent cell membrane shear, are all unknown. "Rotating wheel" structures are famously rare in biology, with prokaryotic flagella being the main exception; these mysterious spinning cells found only among Parabasalia are another, far less understood, example.
Asunto(s)
Isópteros , Parabasalidea , Animales , Filogenia , América del SurRESUMEN
Species trees that can generate a nonmatching gene tree topology that is more probable than the topology matching the species tree are said to be in an anomaly zone. We introduce some heuristic approaches to infer whether species trees are in anomaly zones when it is difficult or impossible to compute the entire distribution of gene tree topologies. Here, probabilities of unrooted, unranked, and ranked gene tree topologies under the multispecies coalescent are used. A ranked tree can be viewed as an unranked tree with a temporal ordering of its internal nodes. Overall, considering probabilities of unrooted or unranked gene tree topologies within one nearest neighbor interchange from the species tree topology is a reasonable heuristic to infer the existence of anomalous unrooted or unranked gene trees, respectively. We investigated a test proposed by Linkem et al. (2016) which classifies a species tree as being in an unranked anomaly zone if there is a subset of four taxa in an unranked anomaly zone. We find this test to have high true positive rates, but it can also have high false positive rates. For ranked trees, because at least one of the most probable ranked gene tree topologies must have the same unranked topology as the species tree, we propose to use only those ranked gene trees that have topologies that match the unranked species tree topology. We find that the probability that the species tree is in unrooted and unranked anomaly zones tends to increase with the speciation rate, and the probability of all three types of anomaly zones increases rapidly with the number of taxa. We find that probabilities that species trees are in an anomaly zone can be quite high for moderately high speciation rates.
Asunto(s)
Especiación Genética , Heurística , Modelos Genéticos , Filogenia , Análisis por Conglomerados , ProbabilidadRESUMEN
BACKGROUND: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. RESULTS: We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. CONCLUSIONS: This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances .
Asunto(s)
Evolución Biológica , Genoma , Algoritmos , Secuencia de Bases , Aprendizaje Automático , FilogeniaRESUMEN
BACKGROUND: Plastome-scale data have been prevalent in reconstructing the plant Tree of Life. However, phylogenomic studies currently based on plastomes rely primarily on maximum likelihood inference of concatenated alignments of plastid genes, and thus phylogenetic discordance produced by individual plastid genes has generally been ignored. Moreover, structural and functional characteristics of plastomes indicate that plastid genes may not evolve as a single locus and are experiencing different evolutionary forces, yet the genetic characteristics of plastid genes within a lineage remain poorly studied. RESULTS: We sequenced and annotated 10 plastome sequences of Gentianeae. Phylogenomic analyses yielded robust relationships among genera within Gentianeae. We detected great variation of gene tree topologies and revealed that more than half of the genes, including one (atpB) of the three widely used plastid markers (rbcL, atpB and matK) in phylogenetic inference of Gentianeae, are likely contributing to phylogenetic ambiguity of Gentianeae. Estimation of nucleotide substitution rates showed extensive rate heterogeneity among different plastid genes and among different functional groups of genes. Comparative analysis suggested that the ribosomal protein (RPL and RPS) genes and the RNA polymerase (RPO) genes have higher substitution rates and genetic variations among plastid genes in Gentianeae. Our study revealed that just one (matK) of the three (matK, ndhB and rbcL) widely used markers show high phylogenetic informativeness (PI) value. Due to the high PI and lowest gene-tree discordance, rpoC2 is advocated as a promising plastid DNA barcode for taxonomic studies of Gentianeae. Furthermore, our analyses revealed a positive correlation of evolutionary rates with genetic variation of plastid genes, but a negative correlation with gene-tree discordance under purifying selection. CONCLUSIONS: Overall, our results demonstrate the heterogeneity of nucleotide substitution rates and genetic characteristics among plastid genes providing new insights into plastome evolution, while highlighting the necessity of considering gene-tree discordance into phylogenomic studies based on plastome-scale data.
Asunto(s)
Heterogeneidad Genética , Genoma de Plastidios/genética , Gentianaceae/genética , Plastidios/genética , Código de Barras del ADN Taxonómico , Evolución Molecular , Marcadores Genéticos/genética , Nucleótidos/genética , FilogeniaRESUMEN
âCells are continuously exposed to chemical signals that they must discriminate between and respond to appropriately. In embryophytes, the leucine-rich repeat receptor-like kinases (LRR-RLKs) are signal receptors critical in development and defense. LRR-RLKs have diversified to hundreds of genes in many plant genomes. Although intensively studied, a well-resolved LRR-RLK gene tree has remained elusive. âTo resolve the LRR-RLK gene tree, we developed an improved gene discovery method based on iterative hidden Markov model searching and phylogenetic inference. We used this method to infer complete gene trees for each of the LRR-RLK subclades and reconstructed the deepest nodes of the full gene family. âWe discovered that the LRR-RLK gene family is even larger than previously thought, and that protein domain gains and losses are prevalent. These structural modifications, some of which likely predate embryophyte diversification, led to misclassification of some LRR-RLK variants as members of other gene families. Our work corrects this misclassification. âOur results reveal ongoing structural evolution generating novel LRR-RLK genes. These new genes are raw material for the diversification of signaling in development and defense. Our methods also enable phylogenetic reconstruction in any large gene family.
Asunto(s)
Evolución Molecular , Genoma de Planta , Filogenia , Dominios ProteicosRESUMEN
BACKGROUND: The flood of genomic data to help build and date the tree of life requires automation at several critical junctures, most importantly during sequence assembly and alignment. It is widely appreciated that automated alignment protocols can yield inaccuracies, but the relative impact of various sources error on phylogenomic analysis is not yet known. This study employs an updated mammal data set of 5162 coding loci sampled from 90 species to evaluate the effects of alignment uncertainty, substitution models, and fossil priors on gene tree, species tree, and divergence time estimation. Additionally, a novel coalescent likelihood ratio test is introduced for comparing competing species trees against a given set of gene trees. RESULTS: The aligned DNA sequences of 5162 loci from 90 species were trimmed and filtered using trimAL and two filtering protocols. The final dataset contains 4 sets of alignments - before trimming, after trimming, filtered by a recently proposed pipeline, and further filtered by comparing ML gene trees for each locus with the concatenation tree. Our analyses suggest that the average discordance among the coalescent trees is significantly smaller than that among the concatenation trees estimated from the 4 sets of alignments or with different substitution models. There is no significant difference among the divergence times estimated with different substitution models. However, the divergence dates estimated from the alignments after trimming are more recent than those estimated from the alignments before trimming. CONCLUSIONS: Our results highlight that alignment uncertainty of the updated mammal data set and the choice of substitution models have little impact on tree topologies yielded by coalescent methods for species tree estimation, whereas they are more influential on the trees made by concatenation. Given the choice of calibration scheme and clock models, divergence time estimates are robust to the choice of substitution models, but removing alignments deemed problematic by trimming algorithms can lead to more recent dates. Although the fossil prior is important in divergence time estimation, Bayesian estimates of divergence times in this data set are driven primarily by the sequence data.
Asunto(s)
Mamíferos/clasificación , Mamíferos/genética , Filogenia , Algoritmos , Animales , Teorema de Bayes , Simulación por Computador , Fósiles , Genoma , Modelos Genéticos , IncertidumbreRESUMEN
Genome-wide phylogeny reconstruction is becoming increasingly common, and one driving factor behind these phylogenomic studies is the promise that the potential discordance between gene trees and the species tree can be modeled. Incomplete lineage sorting is one cause of discordance that bridges population genetic and phylogenetic processes. ASTRAL is a species tree reconstruction method that seeks to find the tree with minimum quartet distance to an input set of inferred gene trees. However, the published ASTRAL algorithm only works with one sample per species. To account for polymorphisms in present-day species, one can sample multiple individuals per species to create multi-allele datasets. Here, we introduce how ASTRAL can handle multi-allele datasets. We show that the quartet-based optimization problem extends naturally, and we introduce heuristic methods for building the search space specifically for the case of multi-individual datasets. We study the accuracy and scalability of the multi-individual version of ASTRAL-III using extensive simulation studies and compare it to NJst, the only other scalable method that can handle these datasets. We do not find strong evidence that using multiple individuals dramatically improves accuracy. When we study the trade-off between sampling more genes versus more individuals, we find that sampling more genes is more effective than sampling more individuals, even under conditions that we study where trees are shallow (median length: ≈1Ne) and ILS is extremely high.
Asunto(s)
Alelos , Genómica/métodos , Filogenia , Algoritmos , Simulación por Computador , Bases de Datos Genéticas , Especificidad de la EspecieRESUMEN
The current classification of angiosperms is based primarily on concatenated plastid markers and maximum likelihood (ML) inference. This approach has been justified by the assumption that plastid DNA (ptDNA) is inherited as a single locus and that its individual genes produce congruent trees. However, structural and functional characteristics of ptDNA suggest that plastid genes may not evolve as a single locus and are experiencing different evolutionary forces. To examine this idea, we produced new complete plastid genome (plastome) sequences of 27 species and combined these data with publicly available sequences to produce a final dataset that includes 78 plastid genes for 89 species of rosids and five outgroups. We used four data matrices (i.e., gene, exon, codon-aligned, and amino acid) to infer species and gene trees using ML and multispecies coalescent (MSC) methods. Rosids include about one third of all angiosperms and their two major clades, fabids and malvids, were recovered in almost all analyses. However, we detected incongruence between species trees inferred with different matrices and methods and previously published plastid and nuclear phylogenies. We visualized and tested the significance of incongruence between gene trees and species trees. We then measured the distribution of phylogenetic signal across sites and genes supporting alternative placements of five controversial nodes at different taxonomic levels. Gene trees inferred with plastid data often disagree with species trees inferred using both ML (with unpartitioned or partitioned data) and MSC. Species trees inferred with both methods produced alternative topologies for a few taxa. Our results show that, in a phylogenetic context, plastid protein-coding genes may not be fully linked and behaving as a single locus. Furthermore, concatenated matrices may produce highly supported phylogenies that are discordant with individual gene trees. We also show that phylogenies inferred with MSC are accurate. We therefore emphasize the importance of considering variation in phylogenetic signal across plastid genes and the exploration of plastome data to increase accuracy of estimating relationships. We also support the use of MSC with plastome matrices in future phylogenomic investigations.
Asunto(s)
Genes de Plantas , Filogenia , Plastidios/genética , Secuencia de Bases , Secuencia de Consenso/genética , Genoma de Plastidios , Funciones de Verosimilitud , Magnoliopsida/genética , Análisis de Componente Principal , Especificidad de la EspecieRESUMEN
Reciprocal monophyly, a feature of a genealogy in which multiple groups of descendant lineages each consist of all of the descendants of their respective most recent common ancestors, has been an important concept in studies of species delimitation, phylogeography, population history reconstruction, systematics, and conservation. Computations involving the probability that reciprocal monophyly is observed in a genealogy have played a key role in criteria for defining taxonomic groups and inferring divergence times. The probability of reciprocal monophyly under a coalescent model of population divergence has been studied in detail for groups of gene lineages for pairs of species. Here, we extend this computation to generate corresponding probabilities for sets of gene lineages from three and four species. We study the effects of model parameters on the probability of reciprocal monophyly, finding that it is driven primarily by species tree height, with lesser but still substantial influences of internal branch lengths and sample sizes. We also provide an example application of our results to data from maize and teosinte.
Asunto(s)
Modelos Genéticos , Filogenia , Árboles/genética , ProbabilidadRESUMEN
PREMISE: Parasitic plants with large geographic ranges, and different hosts in parts of their range, may acquire horizontally transferred genes (HGTs), which might sometimes leave a footprint of gradual host and range expansion. Cynomorium coccineum, the only member of the Saxifragales family Cynomoriaceae, is a root holoparasite that occurs in water-stressed habitats from western China to the Canary Islands. It parasitizes at least 10 angiosperm families from different orders, some of them only in parts of its range. This parasite therefore offers an opportunity to trace HGTs as long as parasite-host pairs can be obtained and sequenced. METHODS: By sequencing mitochondrial, plastid, and nuclear loci from parasite-host pairs from throughout the parasite's range and with prior information from completely assembled mitochondrial and plastid genomes, we detected 10 HGTs of five mitochondrial genes. RESULTS: The 10 HGTs appear to have occurred sequentially as C. coccineum expanded from East to West. Molecular-clock models yield Cynomorium stem ages between 66 and 156 Myr, with relaxed clocks converging on 66-67 Myr. Chinese Sapindales, probably Nitraria, were the first source of transferred genes, followed by Iranian and Mediterranean Caryophyllales. The most recently acquired gene appears to come from a Tamarix host in the Iberian Peninsula. CONCLUSIONS: Data on HGTs that have accumulated over the past 15 years, along with this discovery of multiple HGTs within a single widespread species, underline the need for more whole-genome data from parasite-host pairs to investigate whether and how transferred copies coexist with, or replace, native functional genes.
Asunto(s)
Cynomorium/genética , Transferencia de Gen Horizontal , Genes de Plantas , Genoma Mitocondrial , Genoma de Plastidios , Dispersión de las Plantas/genética , Genes Mitocondriales , ItaliaRESUMEN
PREMISE: Although hybridization has played an important role in the evolution of many plant species, phylogenetic reconstructions that include hybridizing lineages have been historically constrained by the available models and data. Restriction-site-associated DNA sequencing (RADseq) has been a popular sequencing technique for the reconstruction of hybridization in the next-generation sequencing era. However, the utility of RADseq for the reconstruction of complex evolutionary networks has not been thoroughly investigated. Conflicting phylogenetic relationships in the genus Medicago have been mainly attributed to hybridization, but the specific hybrid origins of taxa have not been yet clarified. METHODS: We obtained new molecular data from diploid species of Medicago section Medicago using single-digest RADseq to reconstruct evolutionary networks from gene trees, an approach that is computationally tractable with data sets that include several species and complex hybridization patterns. RESULTS: Our analyses revealed that assembly filters to exclusively select a small set of loci with high phylogenetic information led to the most-divergent network topologies. Conversely, alternative clustering thresholds or filters on the number of samples per locus had a lower impact on networks. A strong hybridization signal was detected for M. carstiensis and M. cretacea, while signals were less clear for M. rugosa, M. rhodopea, M. suffruticosa, M. marina, M. scutellata, and M. sativa. CONCLUSIONS: Complex network reconstructions from RADseq gene trees were not robust under variations of the assembly parameters and filters. But when the most-divergent networks were discarded, all remaining analyses consistently supported a hybrid origin for M. carstiensis and M. cretacea.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Medicago , Secuencia de Bases , Filogenia , Análisis de Secuencia de ADNRESUMEN
An ancestral configuration is one of the combinatorially distinct sets of gene lineages that, for a given gene tree, can reach a given node of a specified species tree. Ancestral configurations have appeared in recursive algebraic computations of the conditional probability that a gene tree topology is produced under the multispecies coalescent model for a given species tree. For matching gene trees and species trees, we study the number of ancestral configurations, considered up to an equivalence relation introduced by Wu (Evolution 66:763-775, 2012) to reduce the complexity of the recursive probability computation. We examine the largest number of non-equivalent ancestral configurations possible for a given tree size n. Whereas the smallest number of non-equivalent ancestral configurations increases polynomially with n, we show that the largest number increases with [Formula: see text], where k is a constant that satisfies [Formula: see text]. Under a uniform distribution on the set of binary labeled trees with a given size n, the mean number of non-equivalent ancestral configurations grows exponentially with n. The results refine an earlier analysis of the number of ancestral configurations considered without applying the equivalence relation, showing that use of the equivalence relation does not alter the exponential nature of the increase with tree size.
Asunto(s)
Modelos Genéticos , Filogenia , Algoritmos , Biología Computacional , Evolución Molecular , Especiación Genética , Conceptos Matemáticos , Modelos Estadísticos , ProbabilidadRESUMEN
Compact coalescent histories are combinatorial structures that describe for a given gene tree G and species tree S possibilities for the numbers of coalescences of G that take place on the various branches of S. They have been introduced as a data structure for evaluating probabilities of gene tree topologies conditioning on species trees, reducing computation time compared to standard coalescent histories. When gene trees and species trees have a matching labeled topology [Formula: see text], the compact coalescent histories of t are encoded by particular integer labelings of the branches of t, each integer specifying the number of coalescent events of G present in a branch of S. For matching gene trees and species trees, we investigate enumerative properties of compact coalescent histories. We report a recursion for the number of compact coalescent histories for matching gene trees and species trees, using it to study the numbers of compact coalescent histories for small trees. We show that the number of compact coalescent histories equals the number of coalescent histories if and only if the labeled topology is a caterpillar or a bicaterpillar. The number of compact coalescent histories is seen to increase with tree imbalance: we prove that as the number of taxa n increases, the exponential growth of the number of compact coalescent histories follows [Formula: see text] in the case of caterpillar or bicaterpillar labeled topologies and approximately [Formula: see text] and [Formula: see text] for lodgepole and balanced topologies, respectively. We prove that the mean number of compact coalescent histories of a labeled topology of size n selected uniformly at random grows with [Formula: see text]. Our results contribute to the analysis of the computational complexity of algorithms for computing gene tree probabilities, and to the combinatorial study of gene trees and species trees more generally.
Asunto(s)
Especiación Genética , Modelos Genéticos , Filogenia , Algoritmos , Biología Computacional , Evolución Molecular , Genética de Población/estadística & datos numéricos , Conceptos Matemáticos , ProbabilidadRESUMEN
BACKGROUND: Duplication-Transfer-Loss (DTL) reconciliation is a powerful and increasingly popular technique for studying the evolution of microbial gene families. DTL reconciliation requires the use of rooted gene trees to perform the reconciliation with the species tree, and the standard technique for rooting gene trees is to assign a root that results in the minimum reconciliation cost across all rootings of that gene tree. However, even though it is well understood that many gene trees have multiple optimal roots, only a single optimal root is randomly chosen to create the rooted gene tree and perform the reconciliation. This remains an important overlooked and unaddressed problem in DTL reconciliation, leading to incorrect evolutionary inferences. In this work, we perform an in-depth analysis of the impact of uncertain gene tree rooting on the computed DTL reconciliation and provide the first computational tools to quantify and negate the impact of gene tree rooting uncertainty on DTL reconciliation. RESULTS: Our analysis of a large data set of over 4500 gene families from 100 species shows that a large fraction of gene trees have multiple optimal rootings, that these multiple roots often, but not always, appear closely clustered together in the same region of the gene tree, that many aspects of the reconciliation remain conserved across the multiple rootings, that gene tree error has a profound impact on the prevalence and structure of multiple optimal rootings, and that there are specific interesting patterns in the reconciliation of those gene trees that have multiple optimal roots. CONCLUSIONS: Our results show that unrooted gene trees can be meaningfully reconciled and high-quality evolutionary information can be obtained from them even after accounting for multiple optimal rootings. In addition, the techniques and tools introduced in this paper make it possible to systematically avoid incorrect evolutionary inferences caused by incorrect or uncertain gene tree rooting. These tools have been implemented in the phylogenetic reconciliation software package RANGER-DTL 2.0, freely available from http://compbio.engr.uconn.edu/software/RANGER-DTL/ .
Asunto(s)
Algoritmos , Evolución Molecular , Duplicación de Gen , Transferencia de Gen Horizontal , Genómica/métodos , Familia de Multigenes , Filogenia , Programas Informáticos , IncertidumbreRESUMEN
Species tree reconstruction from genome-wide data is increasingly being attempted, in most cases using a two-step approach of first estimating individual gene trees and then summarizing them to obtain a species tree. The accuracy of this approach, which promises to account for gene tree discordance, depends on the quality of the inferred gene trees. At the same time, phylogenomic and phylotranscriptomic analyses typically use involved bioinformatics pipelines for data preparation. Errors and shortcomings resulting from these preprocessing steps may impact the species tree analyses at the other end of the pipeline. In this article, we first show that the presence of fragmentary data for some species in a gene alignment, as often seen on real data, can result in substantial deterioration of gene trees, and as a result, the species tree. We then investigate a simple filtering strategy where individual fragmentary sequences are removed from individual genes but the rest of the gene is retained. Both in simulations and by reanalyzing a large insect phylotranscriptomic data set, we show the effectiveness of this simple filtering strategy.
Asunto(s)
Genómica/métodos , Filogenia , Análisis de Secuencia de Proteína/métodos , Algoritmos , Animales , Simulación por Computador , Especiación Genética , Genoma , Insectos/genética , Modelos Genéticos , Fragmentos de Péptidos/genéticaRESUMEN
Phylogenomics has ushered in an age of discordance. Analyses often reveal abundant discordances among phylogenies of different parts of genomes, as well as incongruences between species trees obtained using different methods or data partitions. Researchers are often left trying to make sense of such incongruences. Interpretive ways of measuring and visualizing discordance are needed, both among alternative species trees and gene trees, especially for specific focal branches of a tree. Here, we introduce DiscoVista, a publicly available tool that creates a suite of simple but interpretable visualizations. DiscoVista helps quantify the amount of discordance and some of its potential causes.
Asunto(s)
Clasificación/métodos , Programas Informáticos , Genoma , Modelos Genéticos , FilogeniaRESUMEN
BACKGROUND: Hybridization is observed in many eukaryotic lineages and can lead to the formation of polyploid species. The study of hybridization and polyploidization faces challenges both in data generation and in accounting for population-level phenomena such as coalescence processes in phylogenetic analysis. Genus Fragaria is one example of a set of plant taxa in which a range of ploidy levels is observed across species, but phylogenetic origins are unknown. RESULTS: Here, using 20 diploid and polyploid Fragaria species, we combine approaches from NGS data analysis and phylogenetics to infer evolutionary origins of polyploid strawberries, taking into account coalescence processes. We generate haplotype sequences for 257 low-copy nuclear markers assembled from Illumina target capture sequence data. We then identify putative hybridization events by analyzing gene tree topologies, and further test predicted hybridizations in a coalescence framework. This approach confirms the allopolyploid ancestry of F. chiloensis and F. virginiana, and provides new allopolyploid ancestry hypotheses for F. iturupensis, F. moschata, and F. orientalis. Evidence of gene flow between diploids F. bucharica and F. vesca is also detected, suggesting that it might be appropriate to consider these groups as conspecifics. CONCLUSIONS: This study is one of the first in which target capture sequencing followed by computational deconvolution of individual haplotypes is used for tracing origins of polyploid taxa. The study also provides new perspectives on the evolutionary history of Fragaria.