Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 87
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Nature ; 629(8013): 851-860, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38560995

RESUMEN

Despite tremendous efforts in the past decades, relationships among main avian lineages remain heavily debated without a clear resolution. Discrepancies have been attributed to diversity of species sampled, phylogenetic method and the choice of genomic regions1-3. Here we address these issues by analysing the genomes of 363 bird species4 (218 taxonomic families, 92% of total). Using intergenic regions and coalescent methods, we present a well-supported tree but also a marked degree of discordance. The tree confirms that Neoaves experienced rapid radiation at or near the Cretaceous-Palaeogene boundary. Sufficient loci rather than extensive taxon sampling were more effective in resolving difficult nodes. Remaining recalcitrant nodes involve species that are a challenge to model due to either extreme DNA composition, variable substitution rates, incomplete lineage sorting or complex evolutionary events such as ancient hybridization. Assessment of the effects of different genomic partitions showed high heterogeneity across the genome. We discovered sharp increases in effective population size, substitution rates and relative brain size following the Cretaceous-Palaeogene extinction event, supporting the hypothesis that emerging ecological opportunities catalysed the diversification of modern birds. The resulting phylogenetic estimate offers fresh insights into the rapid radiation of modern birds and provides a taxon-rich backbone tree for future comparative studies.


Asunto(s)
Aves , Evolución Molecular , Genoma , Filogenia , Animales , Aves/genética , Aves/clasificación , Aves/anatomía & histología , Encéfalo/anatomía & histología , Extinción Biológica , Genoma/genética , Genómica , Densidad de Población , Masculino , Femenino
2.
Bioinformatics ; 40(Suppl 2): ii20-ii28, 2024 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-39230710

RESUMEN

MOTIVATION: Despite the widespread occurrence of polyploids across the Tree of Life, especially in the plant kingdom, very few computational methods have been developed to handle the specific complexities introduced by polyploids in phylogeny estimation. Furthermore, methods that are designed to account for polyploidy often disregard incomplete lineage sorting (ILS), a major source of heterogeneous gene histories, or are computationally very demanding. Therefore, there is a great need for efficient and robust methods to accurately reconstruct polyploid phylogenies. RESULTS: We introduce Polyphest (POLYploid PHylogeny ESTimation), a new method for efficiently and accurately inferring species phylogenies in the presence of both polyploidy and ILS. Polyphest bypasses the need for extensive network space searches by first generating a multilabeled tree based on gene trees, which is then converted into a (uniquely labeled) species phylogeny. We compare the performance of Polyphest to that of two polyploid phylogeny estimation methods, one of which does not account for ILS, namely PADRE, and another that accounts for ILS, namely MPAllopp. Polyphest is more accurate than PADRE and achieves comparable accuracy to MPAllopp, while being significantly faster. We also demonstrate the application of Polyphest to empirical data from the hexaploid bread wheat and confirm the allopolyploid origin of bread wheat along with the closest relatives for each of its subgenomes. AVAILABILITY AND IMPLEMENTATION: Polyphest is available at https://github.com/NakhlehLab/Polyphest.


Asunto(s)
Filogenia , Poliploidía , Programas Informáticos , Algoritmos , Evolución Molecular
3.
Genome Res ; 31(11): 2107-2119, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34426513

RESUMEN

Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, is coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies, and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation, it is flexible enough to enable future implementations of various population models.


Asunto(s)
Genética de Población , Modelos Genéticos , Animales , Simulación por Computador , Humanos , Densidad de Población , Recombinación Genética
4.
PLoS Genet ; 17(8): e1009701, 2021 08.
Artículo en Inglés | MEDLINE | ID: mdl-34407067

RESUMEN

Trait evolution among a set of species-a central theme in evolutionary biology-has long been understood and analyzed with respect to a species tree. However, the field of phylogenomics, which has been propelled by advances in sequencing technologies, has ushered in the era of species/gene tree incongruence and, consequently, a more nuanced understanding of trait evolution. For a trait whose states are incongruent with the branching patterns in the species tree, the same state could have arisen independently in different species (homoplasy) or followed the branching patterns of gene trees, incongruent with the species tree (hemiplasy). Another evolutionary process whose extent and significance are better revealed by phylogenomic studies is gene flow between different species. In this work, we present a phylogenomic method for assessing the role of hybridization and introgression in the evolution of polymorphic or monomorphic binary traits. We apply the method to simulated evolutionary scenarios to demonstrate the interplay between the parameters of the evolutionary history and the role of introgression in a binary trait's evolution (which we call xenoplasy). Very importantly, we demonstrate, including on a biological data set, that inferring a species tree and using it for trait evolution analysis in the presence of gene flow could lead to misleading hypotheses about trait evolution.


Asunto(s)
Biología Computacional/métodos , Introgresión Genética/genética , Sitios de Carácter Cuantitativo , Evolución Molecular , Especiación Genética , Modelos Genéticos , Fenotipo , Filogenia
5.
Bioinformatics ; 38(10): 2912-2914, 2022 05 13.
Artículo en Inglés | MEDLINE | ID: mdl-35561189

RESUMEN

SUMMARY: We report on a new single-cell DNA sequence simulator, SimSCSnTree, which generates an evolutionary tree of cells and evolves single nucleotide variants (SNVs) and copy number aberrations (CNAs) along its branches. Data generated by the simulator can be used to benchmark tools for single-cell genomic analyses, particularly in cancer where SNVs and CNAs are ubiquitous. AVAILABILITY AND IMPLEMENTATION: SimSCSnTree is now on BioConda and also is freely available for download at https://github.com/compbiofan/SimSCSnTree.git with detailed documentation.


Asunto(s)
Genoma , Genómica , Secuencia de Bases , Variaciones en el Número de Copia de ADN , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Análisis de la Célula Individual , Programas Informáticos
6.
Bioinformatics ; 38(Suppl 1): i195-i202, 2022 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-35758771

RESUMEN

MOTIVATION: Single-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data. RESULTS: Here, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases. AVAILABILITY AND IMPLEMENTATION: Phylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Nucleótidos , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Filogenia , Análisis de Secuencia de ADN
7.
Mol Phylogenet Evol ; 181: 107724, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36720421

RESUMEN

Accurate inference of population parameters plays a pivotal role in unravelling evolutionary histories. While recombination has been universally accepted as a fundamental process in the evolution of sexually reproducing organisms, it remains challenging to model it exactly. Thus, existing coalescent-based approaches make different assumptions or approximations to facilitate phylogenetic inference, which can potentially bring about biases in estimates of evolutionary parameters when recombination is present. In this article, we evaluate the performance of population parameter estimation using three methods-StarBEAST2, SNAPP, and diCal2-that represent three different types of inference. We performed whole-genome simulations in which recombination rates, mutation rates, and levels of incomplete lineage sorting were varied. We show that StarBEAST2 using short or medium-sized loci is robust to realistic rates of recombination, which is in agreement with previous studies. SNAPP, as expected, is generally unaffected by recombination events. Most surprisingly, diCal2, a method that is designed to explicitly account for recombination, performs considerably worse than other methods under comparison.


Asunto(s)
Genoma , Tasa de Mutación , Filogenia , Recombinación Genética , Modelos Genéticos , Simulación por Computador
8.
Syst Biol ; 71(3): 706-720, 2022 04 19.
Artículo en Inglés | MEDLINE | ID: mdl-34605924

RESUMEN

Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this article, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene tree topologies as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating evolutionary hypotheses in the form of phylogenetic networks are implemented and publicly available in the PhyloNet software package. [Incomplete lineage sorting; minimizing deep coalescences; multilabeled trees; multispecies network coalescent; phylogenetic networks; polyploidy.].


Asunto(s)
Hibridación Genética , Poliploidía , Evolución Biológica , Humanos , Filogenia
9.
Syst Biol ; 71(2): 367-381, 2022 02 10.
Artículo en Inglés | MEDLINE | ID: mdl-34245291

RESUMEN

Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.].


Asunto(s)
Duplicación de Gen , Modelos Genéticos , Simulación por Computador , Genoma , Filogenia
10.
PLoS Comput Biol ; 18(6): e1010216, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35675326

RESUMEN

Phylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences (MHGs) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa.


Asunto(s)
Genoma Bacteriano , Células Procariotas , Transferencia de Gen Horizontal , Genoma Bacteriano/genética , Filogenia , Alineación de Secuencia
11.
Genome Res ; 29(11): 1847-1859, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31628257

RESUMEN

Accumulation and selection of somatic mutations in a Darwinian framework result in intra-tumor heterogeneity (ITH) that poses significant challenges to the diagnosis and clinical therapy of cancer. Identification of the tumor cell populations (clones) and reconstruction of their evolutionary relationship can elucidate this heterogeneity. Recently developed single-cell DNA sequencing (SCS) technologies promise to resolve ITH to a single-cell level. However, technical errors in SCS data sets, including false-positives (FP) and false-negatives (FN) due to allelic dropout, and cell doublets, significantly complicate these tasks. Here, we propose a nonparametric Bayesian method that reconstructs the clonal populations as clusters of single cells, genotypes of each clone, and the evolutionary relationship between the clones. It employs a tree-structured Chinese restaurant process as the prior on the number and composition of clonal populations. The evolution of the clonal populations is modeled by a clonal phylogeny and a finite-site model of evolution to account for potential mutation recurrence and losses. We probabilistically account for FP and FN errors, and cell doublets are modeled by employing a Beta-binomial distribution. We develop a Gibbs sampling algorithm comprising partial reversible-jump and partial Metropolis-Hastings updates to explore the joint posterior space of all parameters. The performance of our method on synthetic and experimental data sets suggests that joint reconstruction of tumor clones and clonal phylogeny under a finite-site model of evolution leads to more accurate inferences. Our method is the first to enable this joint reconstruction in a fully Bayesian framework, thus providing measures of support of the inferences it makes.


Asunto(s)
Células Clonales , Genotipo , Neoplasias/genética , Análisis de la Célula Individual/métodos , Teorema de Bayes , Humanos , Filogenia , Mutación Puntual
12.
Mol Biol Evol ; 37(6): 1809-1818, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32077947

RESUMEN

Species tree inference from multilocus data has emerged as a powerful paradigm in the postgenomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown not only to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets. In this article, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data.


Asunto(s)
Modelos Genéticos , Filogenia , Teorema de Bayes , Cadenas de Markov , Método de Montecarlo
13.
PLoS Comput Biol ; 16(7): e1008012, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32658894

RESUMEN

Single-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods-Ginkgo, HMMcopy, and CopyNumber-on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient.


Asunto(s)
Variaciones en el Número de Copia de ADN , Genoma Humano , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Aberraciones Cromosómicas , Biología Computacional , Simulación por Computador , Dosificación de Gen , Humanos , Mutación , Neoplasias/genética , Ploidias , Distribución de Poisson , Curva ROC , Reproducibilidad de los Resultados , Programas Informáticos
14.
BMC Genomics ; 21(Suppl 2): 219, 2020 Apr 16.
Artículo en Inglés | MEDLINE | ID: mdl-32299348

RESUMEN

BACKGROUND: Multi-locus species phylogeny inference is based on models of sequence evolution on gene trees as well as models of gene tree evolution within the branches of species phylogenies. Almost all statistical methods for this inference task assume a common mechanism across all loci as captured by a single value of each branch length of the species phylogeny. RESULTS: In this paper, we pursue a "no common mechanism" (NCM) model, where every gene tree evolves according to its own parameters of the species phylogeny. Based on this model, we derive an analytically integrated likelihood of both species trees and networks given the gene trees of multiple loci under an NCM model. We demonstrate the performance of inference under this integrated likelihood on both simulated and biological data. CONCLUSIONS: The model presented here will afford opportunities for exploring connections among various criteria for estimating species phylogenies from multiple, independent loci. Furthermore, further development of this model could potentially result in more efficient methods for searching the space of species phylogenies by focusing solely on the topology of the phylogeny.


Asunto(s)
Evolución Molecular , Genómica/métodos , Animales , Simulación por Computador , Culicidae/genética , Especiación Genética , Funciones de Verosimilitud , Modelos Genéticos , Redes Neurales de la Computación , Filogenia , Probabilidad , Testamentos/estadística & datos numéricos
15.
Genome Res ; 27(5): 793-800, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28104618

RESUMEN

Achieving complete, accurate, and cost-effective assembly of human genomes is of great importance for realizing the promise of precision medicine. The abundance of repeats and genetic variations in human genomes and the limitations of existing sequencing technologies call for the development of novel assembly methods that can leverage the complementary strengths of multiple technologies. We propose a Hybrid Structural variant Assembly (HySA) approach that integrates sequencing reads from next-generation sequencing and single-molecule sequencing technologies to accurately assemble and detect structural variants (SVs) in human genomes. By identifying homologous SV-containing reads from different technologies through a bipartite-graph-based clustering algorithm, our approach turns a whole genome assembly problem into a set of independent SV assembly problems, each of which can be effectively solved to enhance the assembly of structurally altered regions in human genomes. We used data generated from a haploid hydatidiform mole genome (CHM1) and a diploid human genome (NA12878) to test our approach. The result showed that, compared with existing methods, our approach had a low false discovery rate and substantially improved the detection of many types of SVs, particularly novel large insertions, small indels (10-50 bp), and short tandem repeat expansions and contractions. Our work highlights the strengths and limitations of current approaches and provides an effective solution for extending the power of existing sequencing technologies for SV discovery.


Asunto(s)
Mapeo Contig/métodos , Genoma Humano , Variación Estructural del Genoma , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Animales , Mapeo Contig/normas , Diploidia , Genómica/normas , Haploidia , Humanos , Ratones , Análisis de Secuencia de ADN/normas , Secuencias Repetidas en Tándem
16.
Bioinformatics ; 35(14): i370-i378, 2019 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-31510688

RESUMEN

MOTIVATION: Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. RESULTS: In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. AVAILABILITY AND IMPLEMENTATION: We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Filogenia , Evolución Molecular , Genoma , Alineación de Secuencia , Programas Informáticos
17.
PLoS Genet ; 13(2): e1006598, 2017 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-28178269

RESUMEN

[This corrects the article DOI: 10.1371/journal.pgen.1006006.].

18.
Nat Methods ; 13(6): 505-7, 2016 06.
Artículo en Inglés | MEDLINE | ID: mdl-27088313

RESUMEN

Current variant callers are not suitable for single-cell DNA sequencing, as they do not account for allelic dropout, false-positive errors and coverage nonuniformity. We developed Monovar (https://bitbucket.org/hamimzafar/monovar), a statistical method for detecting and genotyping single-nucleotide variants in single-cell data. Monovar exhibited superior performance over standard algorithms on benchmarks and in identifying driver mutations and delineating clonal substructure in three different human tumor data sets.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Benchmarking , Línea Celular , Exoma/genética , Humanos , Sensibilidad y Especificidad , Análisis de Secuencia de ADN/estadística & datos numéricos , Análisis de la Célula Individual/estadística & datos numéricos
19.
Bioinformatics ; 34(17): i697-i705, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30423064

RESUMEN

Motivation: Species and gene trees represent how species and individual loci within their genomes evolve from their most recent common ancestors. These trees are central to addressing several questions in biology relating to, among other issues, species conservation, trait evolution and gene function. Consequently, their accurate inference from genomic data is a major endeavor. One approach to their inference is to co-estimate species and gene trees from genome-wide data. Indeed, Bayesian methods based on this approach already exist. However, these methods are very slow, limiting their applicability to datasets with small numbers of taxa. The more commonly used approach is to first infer gene trees individually, and then use gene tree estimates to infer the species tree. Methods in this category rely significantly on the accuracy of the gene trees which is often not high when the dataset includes closely related species. Results: In this work, we introduce a simple, yet effective, iterative method for co-estimating gene and species trees from sequence data of multiple, unlinked loci. In every iteration, the method estimates a species tree, uses it as a generative process to simulate a collection of gene trees, and then selects gene trees for the individual loci from among the simulated gene trees by making use of the sequence data. We demonstrate the accuracy and efficiency of our method on simulated as well as biological data, and compare them to those of existing competing methods. Availability and implementation: The method has been implemented in PhyloNet, which is publicly available at http://bioinfocs.rice.edu/phylonet.


Asunto(s)
Simulación por Computador , Aprendizaje Automático , Teorema de Bayes , Genómica/métodos , Modelos Genéticos
20.
Bioinformatics ; 34(13): i376-i385, 2018 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-29950004

RESUMEN

Motivation: Phylogenetic networks represent reticulate evolutionary histories. Statistical methods for their inference under the multispecies coalescent have recently been developed. A particularly powerful approach uses data that consist of bi-allelic markers (e.g. single nucleotide polymorphism data) and allows for exact likelihood computations of phylogenetic networks while numerically integrating over all possible gene trees per marker. While the approach has good accuracy in terms of estimating the network and its parameters, likelihood computations remain a major computational bottleneck and limit the method's applicability. Results: In this article, we first demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. We then propose an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. We demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data. Furthermore, we demonstrate aspects of robustness of the method to violations in the underlying assumptions of the employed statistical model. Finally, we demonstrate the application of the method to biological data. The proposed method allows for analyzing larger datasets in terms of the numbers of taxa and reticulation events. While pseudo-likelihood had been proposed before for data consisting of gene trees, the work here uses sequence data directly, offering several advantages as we discuss. Availability and implementation: The methods have been implemented in PhyloNet (http://bioinfocs.rice.edu/phylonet).


Asunto(s)
Alelos , Biología Computacional/métodos , Modelos Genéticos , Filogenia , Programas Informáticos , Evolución Molecular , Probabilidad
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA