RESUMEN
Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1-3. Here we address these issues by analysing the genomes of 363 bird species4 (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous-Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous-Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.
Asunto(s)
Aves , Evolución Molecular , Genoma , Filogenia , Animales , Aves/genética , Aves/clasificación , Aves/anatomía & histología , Encéfalo/anatomía & histología , Extinción Biológica , Genoma/genética , Genómica , Densidad de Población , Masculino , FemeninoRESUMEN
MOTIVATION: Despite the widespread occurrence of polyploids across the Tree of Life, especially in the plant kingdom, very few computational methods have been developed to handle the specific complexities introduced by polyploids in phylogeny estimation. Furthermore, methods that are designed to account for polyploidy often disregard incomplete lineage sorting (ILS), a major source of heterogeneous gene histories, or are computationally very demanding. Therefore, there is a great need for efficient and robust methods to accurately reconstruct polyploid phylogenies. RESULTS: We introduce Polyphest (POLYploid PHylogeny ESTimation), a new method for efficiently and accurately inferring species phylogenies in the presence of both polyploidy and ILS. Polyphest bypasses the need for extensive network space searches by first generating a multilabeled tree based on gene trees, which is then converted into a (uniquely labeled) species phylogeny. We compare the performance of Polyphest to that of two polyploid phylogeny estimation methods, one of which does not account for ILS, namely PADRE, and another that accounts for ILS, namely MPAllopp. Polyphest is more accurate than PADRE and achieves comparable accuracy to MPAllopp, while being significantly faster. We also demonstrate the application of Polyphest to empirical data from the hexaploid bread wheat and confirm the allopolyploid origin of bread wheat along with the closest relatives for each of its subgenomes. AVAILABILITY AND IMPLEMENTATION: Polyphest is available at https://github.com/NakhlehLab/Polyphest.
Asunto(s)
Filogenia , Poliploidía , Programas Informáticos , Algoritmos , Evolución MolecularRESUMEN
Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, is coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies, and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation, it is flexible enough to enable future implementations of various population models.
Asunto(s)
Genética de Población , Modelos Genéticos , Animales , Simulación por Computador , Humanos , Densidad de Población , Recombinación GenéticaRESUMEN
Trait evolution among a set of species-a central theme in evolutionary biology-has long been understood and analyzed with respect to a species tree. However, the field of phylogenomics, which has been propelled by advances in sequencing technologies, has ushered in the era of species/gene tree incongruence and, consequently, a more nuanced understanding of trait evolution. For a trait whose states are incongruent with the branching patterns in the species tree, the same state could have arisen independently in different species (homoplasy) or followed the branching patterns of gene trees, incongruent with the species tree (hemiplasy). Another evolutionary process whose extent and significance are better revealed by phylogenomic studies is gene flow between different species. In this work, we present a phylogenomic method for assessing the role of hybridization and introgression in the evolution of polymorphic or monomorphic binary traits. We apply the method to simulated evolutionary scenarios to demonstrate the interplay between the parameters of the evolutionary history and the role of introgression in a binary trait's evolution (which we call xenoplasy). Very importantly, we demonstrate, including on a biological data set, that inferring a species tree and using it for trait evolution analysis in the presence of gene flow could lead to misleading hypotheses about trait evolution.
Asunto(s)
Biología Computacional/métodos , Introgresión Genética/genética , Sitios de Carácter Cuantitativo , Evolución Molecular , Especiación Genética , Modelos Genéticos , Fenotipo , FilogeniaRESUMEN
SUMMARY: We report on a new single-cell DNA sequence simulator, SimSCSnTree, which generates an evolutionary tree of cells and evolves single nucleotide variants (SNVs) and copy number aberrations (CNAs) along its branches. Data generated by the simulator can be used to benchmark tools for single-cell genomic analyses, particularly in cancer where SNVs and CNAs are ubiquitous. AVAILABILITY AND IMPLEMENTATION: SimSCSnTree is now on BioConda and also is freely available for download at https://github.com/compbiofan/SimSCSnTree.git with detailed documentation.
Asunto(s)
Genoma , Genómica , Secuencia de Bases , Variaciones en el Número de Copia de ADN , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Análisis de la Célula Individual , Programas InformáticosRESUMEN
MOTIVATION: Single-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data. RESULTS: Here, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases. AVAILABILITY AND IMPLEMENTATION: Phylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Nucleótidos , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Filogenia , Análisis de Secuencia de ADNRESUMEN
Accurate inference of population parameters plays a pivotal role in unravelling evolutionary histories. While recombination has been universally accepted as a fundamental process in the evolution of sexually reproducing organisms, it remains challenging to model it exactly. Thus, existing coalescent-based approaches make different assumptions or approximations to facilitate phylogenetic inference, which can potentially bring about biases in estimates of evolutionary parameters when recombination is present. In this article, we evaluate the performance of population parameter estimation using three methods-StarBEAST2, SNAPP, and diCal2-that represent three different types of inference. We performed whole-genome simulations in which recombination rates, mutation rates, and levels of incomplete lineage sorting were varied. We show that StarBEAST2 using short or medium-sized loci is robust to realistic rates of recombination, which is in agreement with previous studies. SNAPP, as expected, is generally unaffected by recombination events. Most surprisingly, diCal2, a method that is designed to explicitly account for recombination, performs considerably worse than other methods under comparison.
Asunto(s)
Genoma , Tasa de Mutación , Filogenia , Recombinación Genética , Modelos Genéticos , Simulación por ComputadorRESUMEN
Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this article, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene tree topologies as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating evolutionary hypotheses in the form of phylogenetic networks are implemented and publicly available in the PhyloNet software package. [Incomplete lineage sorting; minimizing deep coalescences; multilabeled trees; multispecies network coalescent; phylogenetic networks; polyploidy.].
Asunto(s)
Hibridación Genética , Poliploidía , Evolución Biológica , Humanos , FilogeniaRESUMEN
Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.].
Asunto(s)
Duplicación de Gen , Modelos Genéticos , Simulación por Computador , Genoma , FilogeniaRESUMEN
Phylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences (MHGs) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa.
Asunto(s)
Genoma Bacteriano , Células Procariotas , Transferencia de Gen Horizontal , Genoma Bacteriano/genética , Filogenia , Alineación de SecuenciaRESUMEN
Accumulation and selection of somatic mutations in a Darwinian framework result in intra-tumor heterogeneity (ITH) that poses significant challenges to the diagnosis and clinical therapy of cancer. Identification of the tumor cell populations (clones) and reconstruction of their evolutionary relationship can elucidate this heterogeneity. Recently developed single-cell DNA sequencing (SCS) technologies promise to resolve ITH to a single-cell level. However, technical errors in SCS data sets, including false-positives (FP) and false-negatives (FN) due to allelic dropout, and cell doublets, significantly complicate these tasks. Here, we propose a nonparametric Bayesian method that reconstructs the clonal populations as clusters of single cells, genotypes of each clone, and the evolutionary relationship between the clones. It employs a tree-structured Chinese restaurant process as the prior on the number and composition of clonal populations. The evolution of the clonal populations is modeled by a clonal phylogeny and a finite-site model of evolution to account for potential mutation recurrence and losses. We probabilistically account for FP and FN errors, and cell doublets are modeled by employing a Beta-binomial distribution. We develop a Gibbs sampling algorithm comprising partial reversible-jump and partial Metropolis-Hastings updates to explore the joint posterior space of all parameters. The performance of our method on synthetic and experimental data sets suggests that joint reconstruction of tumor clones and clonal phylogeny under a finite-site model of evolution leads to more accurate inferences. Our method is the first to enable this joint reconstruction in a fully Bayesian framework, thus providing measures of support of the inferences it makes.
Asunto(s)
Células Clonales , Genotipo , Neoplasias/genética , Análisis de la Célula Individual/métodos , Teorema de Bayes , Humanos , Filogenia , Mutación PuntualRESUMEN
Species tree inference from multilocus data has emerged as a powerful paradigm in the postgenomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown not only to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets. In this article, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.
Asunto(s)
Modelos Genéticos , Filogenia , Teorema de Bayes , Cadenas de Markov , Método de MontecarloRESUMEN
Single-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods-Ginkgo, HMMcopy, and CopyNumber-on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient.
Asunto(s)
Variaciones en el Número de Copia de ADN , Genoma Humano , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Aberraciones Cromosómicas , Biología Computacional , Simulación por Computador , Dosificación de Gen , Humanos , Mutación , Neoplasias/genética , Ploidias , Distribución de Poisson , Curva ROC , Reproducibilidad de los Resultados , Programas InformáticosRESUMEN
BACKGROUND: Multi-locus species phylogeny inference is based on models of sequence evolution on gene trees as well as models of gene tree evolution within the branches of species phylogenies. Almost all statistical methods for this inference task assume a common mechanism across all loci as captured by a single value of each branch length of the species phylogeny. RESULTS: In this paper, we pursue a "no common mechanism" (NCM) model, where every gene tree evolves according to its own parameters of the species phylogeny. Based on this model, we derive an analytically integrated likelihood of both species trees and networks given the gene trees of multiple loci under an NCM model. We demonstrate the performance of inference under this integrated likelihood on both simulated and biological data. CONCLUSIONS: The model presented here will afford opportunities for exploring connections among various criteria for estimating species phylogenies from multiple, independent loci. Furthermore, further development of this model could potentially result in more efficient methods for searching the space of species phylogenies by focusing solely on the topology of the phylogeny.
Asunto(s)
Evolución Molecular , Genómica/métodos , Animales , Simulación por Computador , Culicidae/genética , Especiación Genética , Funciones de Verosimilitud , Modelos Genéticos , Redes Neurales de la Computación , Filogenia , Probabilidad , Testamentos/estadística & datos numéricosRESUMEN
Achieving complete, accurate, and cost-effective assembly of human genomes is of great importance for realizing the promise of precision medicine. The abundance of repeats and genetic variations in human genomes and the limitations of existing sequencing technologies call for the development of novel assembly methods that can leverage the complementary strengths of multiple technologies. We propose a Hybrid Structural variant Assembly (HySA) approach that integrates sequencing reads from next-generation sequencing and single-molecule sequencing technologies to accurately assemble and detect structural variants (SVs) in human genomes. By identifying homologous SV-containing reads from different technologies through a bipartite-graph-based clustering algorithm, our approach turns a whole genome assembly problem into a set of independent SV assembly problems, each of which can be effectively solved to enhance the assembly of structurally altered regions in human genomes. We used data generated from a haploid hydatidiform mole genome (CHM1) and a diploid human genome (NA12878) to test our approach. The result showed that, compared with existing methods, our approach had a low false discovery rate and substantially improved the detection of many types of SVs, particularly novel large insertions, small indels (10-50 bp), and short tandem repeat expansions and contractions. Our work highlights the strengths and limitations of current approaches and provides an effective solution for extending the power of existing sequencing technologies for SV discovery.
Asunto(s)
Mapeo Contig/métodos , Genoma Humano , Variación Estructural del Genoma , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Animales , Mapeo Contig/normas , Diploidia , Genómica/normas , Haploidia , Humanos , Ratones , Análisis de Secuencia de ADN/normas , Secuencias Repetidas en TándemRESUMEN
MOTIVATION: Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. RESULTS: In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. AVAILABILITY AND IMPLEMENTATION: We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Filogenia , Evolución Molecular , Genoma , Alineación de Secuencia , Programas InformáticosRESUMEN
[This corrects the article DOI: 10.1371/journal.pgen.1006006.].
RESUMEN
Current variant callers are not suitable for single-cell DNA sequencing, as they do not account for allelic dropout, false-positive errors and coverage nonuniformity. We developed Monovar (https://bitbucket.org/hamimzafar/monovar), a statistical method for detecting and genotyping single-nucleotide variants in single-cell data. Monovar exhibited superior performance over standard algorithms on benchmarks and in identifying driver mutations and delineating clonal substructure in three different human tumor data sets.
Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Benchmarking , Línea Celular , Exoma/genética , Humanos , Sensibilidad y Especificidad , Análisis de Secuencia de ADN/estadística & datos numéricos , Análisis de la Célula Individual/estadística & datos numéricosRESUMEN
Motivation: Species and gene trees represent how species and individual loci within their genomes evolve from their most recent common ancestors. These trees are central to addressing several questions in biology relating to, among other issues, species conservation, trait evolution and gene function. Consequently, their accurate inference from genomic data is a major endeavor. One approach to their inference is to co-estimate species and gene trees from genome-wide data. Indeed, Bayesian methods based on this approach already exist. However, these methods are very slow, limiting their applicability to datasets with small numbers of taxa. The more commonly used approach is to first infer gene trees individually, and then use gene tree estimates to infer the species tree. Methods in this category rely significantly on the accuracy of the gene trees which is often not high when the dataset includes closely related species. Results: In this work, we introduce a simple, yet effective, iterative method for co-estimating gene and species trees from sequence data of multiple, unlinked loci. In every iteration, the method estimates a species tree, uses it as a generative process to simulate a collection of gene trees, and then selects gene trees for the individual loci from among the simulated gene trees by making use of the sequence data. We demonstrate the accuracy and efficiency of our method on simulated as well as biological data, and compare them to those of existing competing methods. Availability and implementation: The method has been implemented in PhyloNet, which is publicly available at http://bioinfocs.rice.edu/phylonet.
Asunto(s)
Simulación por Computador , Aprendizaje Automático , Teorema de Bayes , Genómica/métodos , Modelos GenéticosRESUMEN
Motivation: Phylogenetic networks represent reticulate evolutionary histories. Statistical methods for their inference under the multispecies coalescent have recently been developed. A particularly powerful approach uses data that consist of bi-allelic markers (e.g. single nucleotide polymorphism data) and allows for exact likelihood computations of phylogenetic networks while numerically integrating over all possible gene trees per marker. While the approach has good accuracy in terms of estimating the network and its parameters, likelihood computations remain a major computational bottleneck and limit the method's applicability. Results: In this article, we first demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. We then propose an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. We demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data. Furthermore, we demonstrate aspects of robustness of the method to violations in the underlying assumptions of the employed statistical model. Finally, we demonstrate the application of the method to biological data. The proposed method allows for analyzing larger datasets in terms of the numbers of taxa and reticulation events. While pseudo-likelihood had been proposed before for data consisting of gene trees, the work here uses sequence data directly, offering several advantages as we discuss. Availability and implementation: The methods have been implemented in PhyloNet (http://bioinfocs.rice.edu/phylonet).