Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
1.
Nucleic Acids Res ; 2024 Jul 17.
Artículo en Inglés | MEDLINE | ID: mdl-39016185

RESUMEN

Gene clusters are genomic loci that contain multiple genes that are functionally and genetically linked. Gene clusters collectively encode diverse functions, including small molecule biosynthesis, nutrient assimilation, metabolite degradation, and production of proteins essential for growth and development. Identifying gene clusters is a powerful tool for small molecule discovery and provides insight into the ecology and evolution of organisms. Current detection algorithms focus on canonical 'core' biosynthetic functions many gene clusters encode, while overlooking uncommon or unknown cluster classes. These overlooked clusters are a potential source of novel natural products and comprise an untold portion of overall gene cluster repertoires. Unbiased, function-agnostic detection algorithms therefore provide an opportunity to reveal novel classes of gene clusters and more precisely define genome organization. We present CLOCI (Co-occurrence Locus and Orthologous Cluster Identifier), an algorithm that identifies gene clusters using multiple proxies of selection for coordinated gene evolution. Our approach generalizes gene cluster detection and gene cluster family circumscription, improves detection of multiple known functional classes, and unveils non-canonical gene clusters. CLOCI is suitable for genome-enabled small molecule mining, and presents an easily tunable approach for delineating gene cluster families and homologous loci.

2.
Stat Appl Genet Mol Biol ; 23(1)2024 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-38366619

RESUMEN

Methods based on the multi-species coalescent have been widely used in phylogenetic tree estimation using genome-scale DNA sequence data to understand the underlying evolutionary relationship between the sampled species. Evolutionary processes such as hybridization, which creates new species through interbreeding between two different species, necessitate inferring a species network instead of a species tree. A species tree is strictly bifurcating and thus fails to incorporate hybridization events which require an internal node of degree three. Hence, it is crucial to decide whether a tree or network analysis should be performed given a DNA sequence data set, a decision that is based on the presence of hybrid species in the sampled species. Although many methods have been proposed for hybridization detection, it is rare to find a technique that does so globally while considering a data generation mechanism that allows both hybridization and incomplete lineage sorting. In this paper, we consider hybridization and coalescence in a unified framework and propose a new test that can detect whether there are any hybrid species in a set of species of arbitrary size. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.


Asunto(s)
Evolución Biológica , Hibridación Genética , Filogenia , Modelos Genéticos
3.
Bioinformatics ; 38(23): 5182-5190, 2022 11 30.
Artículo en Inglés | MEDLINE | ID: mdl-36227122

RESUMEN

MOTIVATION: The multispecies coalescent model is now widely accepted as an effective model for incorporating variation in the evolutionary histories of individual genes into methods for phylogenetic inference from genome-scale data. However, because model-based analysis under the coalescent can be computationally expensive for large datasets, a variety of inferential frameworks and corresponding algorithms have been proposed for estimation of species-level phylogenies and associated parameters, including speciation times and effective population sizes. RESULTS: We consider the problem of estimating the timing of speciation events along a phylogeny in a coalescent framework. We propose a maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site-pattern probabilities can be computed under the assumption of a constant θ throughout the species tree. We demonstrate that the MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. We also provide a more computationally efficient estimator of the asymptotic variance based on the non-parametric bootstrap. We evaluate the performance of our method using simulation and by application to an empirical dataset for gibbons. AVAILABILITY AND IMPLEMENTATION: The method has been implemented in the PAUP* program, freely available at https://paup.phylosolutions.com for Macintosh, Windows and Linux operating systems. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Filogenia , Simulación por Computador , Probabilidad , Modelos Genéticos , Especiación Genética
4.
Mol Phylogenet Evol ; 179: 107650, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36441104

RESUMEN

The effect of selection acting on regions of the genome on the accuracy of species-level phylogenetic inference using methods that do not explicitly model selection is an open question that is relevant to most, if not all, phylogenomic studies. To address this, we derive a mathematical approximation to the Wright-Fisher model with mutation and selection in the limit as the population size becomes large. In contrast to previous approximations based on diffusion processes, our approximation can be used to study the distribution of coalescent times for an arbitrary number of lineages, allowing calculation of the probability distribution of gene genealogies under the coalescent model. We use these calculations to show that direct selection at strengths typically encountered in practice has only a small effect on the distribution of coalescent times, and hence on the distribution of gene trees. This implies that many coalescent-based methods for estimating the species tree topology will be robust to the presence of selection in a subset of the underlying genes. Selection will, however, bias the estimation of speciation times, causing them to underestimate the true speciation times. Our model captures the effects of selection on the genealogies that generate the observed sequence data, but does not model selective pressures that act only on the subsequent sequences or that negatively impact gene tree estimation.


Asunto(s)
Especiación Genética , Modelos Genéticos , Filogenia , Probabilidad , Mutación
5.
PLoS Comput Biol ; 18(12): e1010560, 2022 12.
Artículo en Inglés | MEDLINE | ID: mdl-36459515

RESUMEN

Although the role of evolutionary process in cancer progression is widely accepted, increasing attention is being given to the evolutionary mechanisms that can lead to differences in clinical outcome. Recent studies suggest that the temporal order in which somatic mutations accumulate during cancer progression is important. Single-cell sequencing (SCS) provides a unique opportunity to examine the effect that the mutation order has on cancer progression and treatment effect. However, the error rates associated with single-cell sequencing are known to be high, which greatly complicates the task. We propose a novel method for inferring the order in which somatic mutations arise within an individual tumor using noisy data from single-cell sequencing. Our method incorporates models at two levels in that the evolutionary process of somatic mutation within the tumor is modeled along with the technical errors that arise from the single-cell sequencing data collection process. Through analyses of simulations across a wide range of realistic scenarios, we show that our method substantially outperforms existing approaches for identifying mutation order. Most importantly, our method provides a unique means to capture and quantify the uncertainty in the inferred mutation order along a given phylogeny. We illustrate our method by analyzing data from colorectal and prostate cancer patients, in which our method strengthens previously reported mutation orders. Our work is an important step towards producing meaningful prediction of mutation order with high accuracy and measuring the uncertainty of predicted mutation order in cancer patients, with the potential to lead to new insights about the evolutionary trajectories of cancer.


Asunto(s)
Neoplasias , Humanos , Filogenia , Neoplasias/genética , Neoplasias/patología , Procesos Neoplásicos , Mutación/genética , Evolución Biológica
6.
Syst Biol ; 70(1): 33-48, 2021 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-32415974

RESUMEN

Numerous methods for inferring species-level phylogenies under the coalescent model have been proposed within the last 20 years, and debates continue about the relative strengths and weaknesses of these methods. One desirable property of a phylogenetic estimator is that of statistical consistency, which means intuitively that as more data are collected, the probability that the estimated tree has the same topology as the true tree goes to 1. To date, consistency results for species tree inference under the multispecies coalescent (MSC) have been derived only for summary statistics methods, such as ASTRAL and MP-EST. These methods have been found to be consistent given true gene trees but may be inconsistent when gene trees are estimated from data for loci of finite length. Here, we consider the question of statistical consistency for four taxa for SVDQuartets for general data types, as well as for the maximum likelihood (ML) method in the case in which the data are a collection of sites generated under the MSC model such that the sites are conditionally independent given the species tree (we call these data coalescent independent sites [CIS] data). We show that SVDQuartets is statistically consistent for all data types (i.e., for both CIS data and for multilocus data), and we derive its rate of convergence. We additionally show that ML is consistent for CIS data under the JC69 model and discuss why a proof for the more general multilocus case is difficult. Finally, we compare the performance of ML and SDVQuartets using simulation for both data types. [Consistency; gene tree; maximum likelihood; multilocus data; hylogenetic inference; species tree; SVDQuartets.].


Asunto(s)
Especiación Genética , Modelos Genéticos , Simulación por Computador , Filogenia , Probabilidad
7.
Syst Biol ; 70(5): 891-907, 2021 08 11.
Artículo en Inglés | MEDLINE | ID: mdl-33404632

RESUMEN

Interspecific hybridization is an important evolutionary phenomenon that generates genetic variability in a population and fosters species diversity in nature. The availability of large genome scale data sets has revolutionized hybridization studies to shift from the observation of the presence or absence of hybrids to the investigation of the genomic constitution of hybrids and their genome-specific evolutionary dynamics. Although a handful of methods have been proposed in an attempt to identify hybrids, accurate detection of hybridization from genomic data remains a challenging task. In addition to methods that infer phylogenetic networks or that utilize pairwise divergence, site pattern frequency based and population genetic clustering approaches are popularly used in practice, though the performance of these methods under different hybridization scenarios has not been extensively examined. Here, we use simulated data to comparatively evaluate the performance of four tools that are commonly used to infer hybridization events: the site pattern frequency based methods HyDe and the $D$-statistic (i.e., the ABBA-BABA test) and the population clustering approaches structure and ADMIXTURE. We consider single hybridization scenarios that vary in the time of hybridization and the amount of incomplete lineage sorting (ILS) for different proportions of parental contributions ($\gamma$); introgressive hybridization; multiple hybridization scenarios; and a mixture of ancestral and recent hybridization scenarios. We focus on the statistical power to detect hybridization and the false discovery rate (FDR) for comparisons of the $D$-statistic and HyDe, and the accuracy of the estimates of $\gamma$ as measured by the mean squared error for HyDe, structure, and ADMIXTURE. Both HyDe and the $D$-statistic are powerful for detecting hybridization in all scenarios except those with high ILS, although the $D$-statistic often has an unacceptably high FDR. The estimates of $\gamma$ in HyDe are impressively robust and accurate whereas structure and ADMIXTURE sometimes fail to identify hybrids, particularly when the proportional parental contributions are asymmetric (i.e., when $\gamma$ is close to 0). Moreover, the posterior distribution estimated using structure exhibits multimodality in many scenarios, making interpretation difficult. Our results provide guidance in selecting appropriate methods for identifying hybrid populations from genomic data. [ABBA-BABA test; ADMIXTURE; hybridization; HyDe; introgression; Patterson's $D$-statistic; Structure.].


Asunto(s)
Genoma , Hibridación Genética , Genética de Población , Genómica , Filogenia
8.
J Math Biol ; 86(1): 13, 2022 12 09.
Artículo en Inglés | MEDLINE | ID: mdl-36482146

RESUMEN

Phylogenetic diversity indices such as the Fair Proportion (FP) index are frequently discussed as prioritization criteria in biodiversity conservation. They rank species according to their contribution to overall diversity by taking into account the unique and shared evolutionary history of each species as indicated by its placement in an underlying phylogenetic tree. Traditionally, phylogenetic trees were inferred from single genes and the resulting gene trees were assumed to be a valid estimate for the species tree, i.e., the "true" evolutionary history of the species under consideration. However, nowadays it is common to sequence whole genomes of hundreds or thousands of genes, and it is often the case that conflicting genealogical histories exist in different genes throughout the genome, resulting in discordance between individual gene trees and the species tree. Here, we analyze the effects of gene and species tree discordance on prioritization decisions based on the FP index. In particular, we consider the ranking order of taxa induced by (i) The FP index on a species tree, and (ii) The expected FP index across all gene tree histories associated with the species tree. On the one hand, we show that for particular tree shapes, the two rankings always coincide. On the other hand, we show that for all leaf numbers greater than or equal to five, there exist species trees for which the two rankings differ. Finally, we illustrate the variability in the rankings obtained from the FP index across different gene tree and species tree estimates for an empirical multilocus mammal data set.


Asunto(s)
Filogenia
9.
J Math Biol ; 84(6): 47, 2022 05 03.
Artículo en Inglés | MEDLINE | ID: mdl-35503141

RESUMEN

The evolutionary relationships among organisms have traditionally been represented using rooted phylogenetic trees. However, due to reticulate processes such as hybridization or lateral gene transfer, evolution cannot always be adequately represented by a phylogenetic tree, and rooted phylogenetic networks that describe such complex processes have been introduced as a generalization of rooted phylogenetic trees. In fact, estimating rooted phylogenetic networks from genomic sequence data and analyzing their structural properties is one of the most important tasks in contemporary phylogenetics. Over the last two decades, several subclasses of rooted phylogenetic networks (characterized by certain structural constraints) have been introduced in the literature, either to model specific biological phenomena or to enable tractable mathematical and computational analyses. In the present manuscript, we provide a thorough review of these network classes, as well as provide a biological interpretation of the structural constraints underlying these networks where possible. In addition, we discuss how imposing structural constraints on the network topology can be used to address the scalability and identifiability challenges faced in the estimation of phylogenetic networks from empirical data.


Asunto(s)
Transferencia de Gen Horizontal , Hibridación Genética , Algoritmos , Evolución Biológica , Modelos Genéticos , Filogenia
10.
Mol Phylogenet Evol ; 161: 107142, 2021 08.
Artículo en Inglés | MEDLINE | ID: mdl-33713799

RESUMEN

Despite the recent availability of large-scale genomic data for many individuals, few methods for phylogenetic inference are both computationally efficient and highly accurate for trees with hundreds of taxa. Model-based methods such as those developed in the maximum likelihood and Bayesian frameworks are especially time-consuming, as they involve both computationally intensive calculations on fixed phylogenies and searches through the space of possible phylogenies, and they are known to scale poorly with the addition of taxa. Here, we propose a fast approximation to the maximum likelihood estimator that directly uses continuous trait data, such as allele frequency data. The approximation works by first computing the maximum likelihood estimates of some internal branch lengths, and then inferring the tree-topology using these estimates. Our approach is more computationally efficient than existing methods for such data while still achieving comparable accuracy. This method is innovative in its use of the mathematical properties of tree-topologies for inference, and thus serves as a useful addition to the collection of methods available for estimating phylogenies from continuous trait data.


Asunto(s)
Funciones de Verosimilitud , Filogenia , Teorema de Bayes , Frecuencia de los Genes , Humanos , Fenotipo , Reproducibilidad de los Resultados , Proyectos de Investigación
11.
Bull Math Biol ; 83(9): 93, 2021 07 23.
Artículo en Inglés | MEDLINE | ID: mdl-34297209

RESUMEN

Inference of the evolutionary histories of species, commonly represented by a species tree, is complicated by the divergent evolutionary history of different parts of the genome. Different loci on the genome can have different histories from the underlying species tree (and each other) due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. The multispecies coalescent is a commonly used model for performing inference on species and gene trees in the presence of ILS. This paper introduces Lily-T and Lily-Q, two new methods for species tree inference under the multispecies coalescent. We then compare them to two frequently used methods, SVDQuartets and ASTRAL, using simulated and empirical data. Both methods generally showed improvement over SVDQuartets, and Lily-Q was superior to Lily-T for most simulation settings. The comparison to ASTRAL was more mixed-Lily-Q tended to be better than ASTRAL when the length of recombination-free loci was short, when the coalescent population parameter [Formula: see text] was small, or when the internal branch lengths were longer.


Asunto(s)
Especiación Genética , Conceptos Matemáticos , Teorema de Bayes , Simulación por Computador , Modelos Genéticos , Filogenia
12.
BMC Evol Biol ; 19(1): 112, 2019 05 30.
Artículo en Inglés | MEDLINE | ID: mdl-31146685

RESUMEN

BACKGROUND: Coalescent-based species tree inference has become widely used in the analysis of genome-scale multilocus and SNP datasets when the goal is inference of a species-level phylogeny. However, numerous evolutionary processes are known to violate the assumptions of a coalescence-only model and complicate inference of the species tree. One such process is hybrid speciation, in which a species shares its ancestry with two distinct species. Although many methods have been proposed to detect hybrid speciation, only a few have considered both hybridization and coalescence in a unified framework, and these are generally limited to the setting in which putative hybrid species must be identified in advance. RESULTS: Here we propose a method that can examine genome-scale data for a large number of taxa and detect those taxa that may have arisen via hybridization, as well as their potential "parental" taxa. The method is based on a model that considers both coalescence and hybridization together, and uses phylogenetic invariants to construct a test that scales well in terms of computational time for both the number of taxa and the amount of sequence data. We test the method using simulated data for up 20 taxa and 100,000bp, and find that the method accurately identifies both recent and ancient hybrid species in less than 30 s. We apply the method to two empirical datasets, one composed of Sistrurus rattlesnakes for which hybrid speciation is not supported by previous work, and one consisting of several species of Heliconius butterflies for which some evidence of hybrid speciation has been previously found. CONCLUSIONS: The proposed method is powerful for detecting hybridization for both recent and ancient hybridization events. The computations required can be carried out rapidly for a large number of sequences using genome-scale data, and the method is appropriate for both SNP and multilocus data.


Asunto(s)
Bases de Datos Genéticas , Genómica , Hibridación Genética , Modelos Genéticos , Animales , Mariposas Diurnas/genética , Simulación por Computador , Crotalus/genética , Especiación Genética , Filogenia , Especificidad de la Especie
13.
Bioinformatics ; 34(3): 407-415, 2018 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-29028881

RESUMEN

Motivation: Genotyping and parameter estimation using high throughput sequencing data are everyday tasks for population geneticists, but methods developed for diploids are typically not applicable to polyploid taxa. This is due to their duplicated chromosomes, as well as the complex patterns of allelic exchange that often accompany whole genome duplication (WGD) events. For WGDs within a single lineage (autopolyploids), inbreeding can result from mixed mating and/or double reduction. For WGDs that involve hybridization (allopolyploids), alleles are typically inherited through independently segregating subgenomes. Results: We present two new models for estimating genotypes and population genetic parameters from genotype likelihoods for auto- and allopolyploids. We then use simulations to compare these models to existing approaches at varying depths of sequencing coverage and ploidy levels. These simulations show that our models typically have lower levels of estimation error for genotype and parameter estimates, especially when sequencing coverage is low. Finally, we also apply these models to two empirical datasets from the literature. Overall, we show that the use of genotype likelihoods to model non-standard inheritance patterns is a promising approach for conducting population genomic inferences in polyploids. Availability and implementation: A C ++ program, EBG, is provided to perform inference using the models we describe. It is available under the GNU GPLv3 on GitHub: https://github.com/pblischak/polyploid-genotyping. Contact: blischak.4@osu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Técnicas de Genotipaje/métodos , Endogamia , Polimorfismo de Nucleótido Simple , Poliploidía , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Alelos , Animales , Eucariontes/genética , Genética de Población/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
14.
Syst Biol ; 67(5): 770-785, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-29566212

RESUMEN

Most current methods for inferring species-level phylogenies under the coalescent model assume that no gene flow occurs following speciation. Several studies have examined the impact of gene flow (e.g., Eckert and Carstens 2008; Chung and Ané 2011; Leaché et al. 2014; Solís-Lemus et al. 2016) and of ancestral population structure (DeGeorgio and Rosenberg 2016) on the performance of species-level phylogenetic inference, and analytic results have been proven for network models of gene flow (e.g., Solís-Lemus et al. 2016; Zhu et al. 2016). However, there are few analytic results for a continuous model of gene flow following speciation, despite the development of mathematical tools that could facilitate such study (e.g., Hobolth et al. 2011; Andersen et al. 2014; Tian and Kubatko 2016). In this article, we consider a three-taxon isolation-with-migration model that allows gene flow between sister taxa for a brief period following speciation, as well as variation in the effective population sizes across the species tree. We derive the probabilities of each of the three gene tree topologies under this model, and show that for certain choices of the gene flow and effective population size parameters, anomalous gene trees (i.e., gene trees that are discordant with the species tree but that have higher probability than the gene tree concordant with the species tree) exist. We characterize the region of parameter space producing anomalous trees and show that the probability of the gene tree that is concordant with the species tree can be arbitrarily small. We then show that there is theoretical support for using SVDQuartets with an outgroup to infer the rooted three-taxon species tree in a model of gene flow between sister taxa. We study the performance of SVDQuartets on simulated data and compare it to three other commonly-used methods for species tree inference, ASTRAL, MP-EST, and concatenation. The simulations show that ASTRAL, MP-EST, and concatenation can be statistically inconsistent when gene flow is present, while SVDQuartets performs well, though large sample sizes may be required for certain parameter choices.


Asunto(s)
Flujo Génico , Especiación Genética , Modelos Genéticos , Filogenia , Densidad de Población , Probabilidad
15.
Syst Biol ; 67(5): 821-829, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-29562307

RESUMEN

The analysis of hybridization and gene flow among closely related taxa is a common goal for researchers studying speciation and phylogeography. Many methods for hybridization detection use simple site pattern frequencies from observed genomic data and compare them to null models that predict an absence of gene flow. The theory underlying the detection of hybridization using these site pattern probabilities exploits the relationship between the coalescent process for gene trees within population trees and the process of mutation along the branches of the gene trees. For certain models, site patterns are predicted to occur in equal frequency (i.e., their difference is 0), producing a set of functions called phylogenetic invariants. In this article, we introduce HyDe, a software package for detecting hybridization using phylogenetic invariants arising under the coalescent model with hybridization. HyDe is written in Python and can be used interactively or through the command line using pre-packaged scripts. We demonstrate the use of HyDe on simulated data, as well as on two empirical data sets from the literature. We focus in particular on identifying individual hybrids within population samples and on distinguishing between hybrid speciation and gene flow. HyDe is freely available as an open source Python package under the GNU GPL v3 on both GitHub (https://github.com/pblischak/HyDe) and the Python Package Index (PyPI: https://pypi.python.org/pypi/phyde).


Asunto(s)
Biología Computacional/métodos , Flujo Génico , Especiación Genética , Hibridación Genética , Programas Informáticos
16.
Stat Appl Genet Mol Biol ; 17(3)2018 06 06.
Artículo en Inglés | MEDLINE | ID: mdl-29874197

RESUMEN

The increasing availability of population-level allele frequency data across one or more related populations necessitates the development of methods that can efficiently estimate population genetics parameters, such as the strength of selection acting on the population(s), from such data. Existing methods for this problem in the setting of the Wright-Fisher diffusion model are primarily likelihood-based, and rely on numerical approximation for likelihood computation and on bootstrapping for assessment of variability in the resulting estimates, requiring extensive computation. Recent work has provided a method for obtaining exact samples from general Wright-Fisher diffusion processes, enabling the development of methods for Bayesian estimation in this setting. We develop and implement a Bayesian method for estimating the strength of selection based on the Wright-Fisher diffusion for data sampled at a single time point. The method utilizes the latest algorithms for exact sampling to devise a Markov chain Monte Carlo procedure to draw samples from the joint posterior distribution of the selection coefficient and the allele frequencies. We demonstrate that when assumptions about the initial allele frequencies are accurate the method performs well for both simulated data and for an empirical data set on hypoxia in flies, where we find evidence for strong positive selection in a region of chromosome 2L previously identified. We discuss possible extensions of our method to the more general settings commonly encountered in practice, highlighting the advantages of Bayesian approaches to inference in this setting.


Asunto(s)
Teorema de Bayes , Frecuencia de los Genes , Genética de Población , Modelos Genéticos , Algoritmos , Animales , Drosophila melanogaster/genética , Hipoxia/genética , Funciones de Verosimilitud , Cadenas de Markov , Método de Montecarlo , Polimorfismo de Nucleótido Simple
17.
Bull Math Biol ; 81(2): 408-430, 2019 02.
Artículo en Inglés | MEDLINE | ID: mdl-29926380

RESUMEN

Coalescent models of evolution account for incomplete lineage sorting by specifying a species tree parameter which determines a distribution on gene trees, and consequently, a site pattern probability distribution. It has been shown that the unrooted topology of the species tree parameter of the multispecies coalescent is generically identifiable, and a reconstruction method called SVDQuartets has been developed to infer this topology. In this paper, we describe a modified multispecies coalescent model that allows for varying effective population size and violations of the molecular clock. We show that the unrooted topology of the species tree parameter for these models is generically identifiable and that SVDQuartets can still be used to infer this topology.


Asunto(s)
Modelos Genéticos , Filogenia , Biología Computacional , Simulación por Computador , Evolución Molecular , Especiación Genética , Conceptos Matemáticos , Modelos Estadísticos , Probabilidad
18.
Syst Biol ; 66(4): 620-636, 2017 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-28123114

RESUMEN

Detecting variation in the evolutionary process along chromosomes is increasingly important as whole-genome data become more widely available. For example, factors such as incomplete lineage sorting, horizontal gene transfer, and chromosomal inversion are expected to result in changes in the underlying gene trees along a chromosome, while changes in selective pressure and mutational rates for different genomic regions may lead to shifts in the underlying mutational process. We propose the split score as a general method for quantifying support for a particular phylogenetic relationship within a genomic data set. Because the split score is based on algebraic properties of a matrix of site pattern frequencies, it can be rapidly computed, even for data sets that are large in the number of taxa and/or in the length of the alignment, providing an advantage over other methods (e.g., maximum likelihood) that are often used to assess such support. Using simulation, we explore the properties of the split score, including its dependence on sequence length, branch length, size of a split and its ability to detect true splits in the underlying tree. Using a sliding window analysis, we show that split scores can be used to detect changes in the underlying evolutionary process for genome-scale data from primates, mosquitoes, and viruses in a computationally efficient manner. Computation of the split score has been implemented in the software package SplitSup.


Asunto(s)
Clasificación/métodos , Filogenia , Animales , Culicidae/clasificación , Culicidae/genética , Evolución Molecular , Transferencia de Gen Horizontal , Genoma/genética , Primates/clasificación , Primates/genética , Programas Informáticos , Virus/clasificación , Virus/genética
20.
BMC Evol Biol ; 17(1): 263, 2017 12 19.
Artículo en Inglés | MEDLINE | ID: mdl-29258427

RESUMEN

BACKGROUND: Phylogenetic tree inference is a fundamental tool to estimate ancestor-descendant relationships among different species. In phylogenetic studies, identification of the root - the most recent common ancestor of all sampled organisms - is essential for complete understanding of the evolutionary relationships. Rooted trees benefit most downstream application of phylogenies such as species classification or study of adaptation. Often, trees can be rooted by using outgroups, which are species that are known to be more distantly related to the sampled organisms than any other species in the phylogeny. However, outgroups are not always available in evolutionary research. METHODS: In this study, we develop a new method for rooting species tree under the coalescent model, by developing a series of hypothesis tests for rooting quartet phylogenies using site pattern probabilities. The power of this method is examined by simulation studies and by application to an empirical North American rattlesnake data set. RESULTS: The method shows high accuracy across the simulation conditions considered, and performs well for the rattlesnake data. Thus, it provides a computationally efficient way to accurately root species-level phylogenies that incorporates the coalescent process. The method is robust to variation in substitution model, but is sensitive to the assumption of a molecular clock. CONCLUSIONS: Our study establishes a computationally practical method for rooting species trees that is more efficient than traditional methods. The method will benefit numerous evolutionary studies that require rooting a phylogenetic tree without having to specify outgroups.


Asunto(s)
Evolución Biológica , Simulación por Computador , Modelos Genéticos , Filogenia , Probabilidad
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA