Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 54
Filtrar
1.
Bioinformatics ; 39(39 Suppl 1): i177-i184, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387175

RESUMO

The classic quantitative measure of phylogenetic diversity (PD) has been used to address problems in conservation biology, microbial ecology, and evolutionary biology. PD is the minimum total length of the branches in a phylogeny required to cover a specified set of taxa on the phylogeny. A general goal in the application of PD has been identifying a set of taxa of size k that maximize PD on a given phylogeny; this has been mirrored in active research to develop efficient algorithms for the problem. Other descriptive statistics, such as the minimum PD, average PD, and standard deviation of PD, can provide invaluable insight into the distribution of PD across a phylogeny (relative to a fixed value of k). However, there has been limited or no research on computing these statistics, especially when required for each clade in a phylogeny, enabling direct comparisons of PD between clades. We introduce efficient algorithms for computing PD and the associated descriptive statistics for a given phylogeny and each of its clades. In simulation studies, we demonstrate the ability of our algorithms to analyze large-scale phylogenies with applications in ecology and evolutionary biology. The software is available at https://github.com/flu-crew/PD_stats.


Assuntos
Algoritmos , Evolução Biológica , Filogenia , Simulação por Computador , Software
2.
Syst Biol ; 72(5): 1052-1063, 2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37208300

RESUMO

The use of next-generation sequencing technology has enabled phylogenetic studies with hundreds of thousands of taxa. Such large-scale phylogenies have become a critical component in genomic epidemiology in pathogens such as SARS-CoV-2 and influenza A virus. However, detailed phenotypic characterization of pathogens or generating a computationally tractable dataset for detailed phylogenetic analyses requires objective subsampling of taxa. To address this need, we propose parnas, an objective and flexible algorithm to sample and select taxa that best represent observed diversity by solving a generalized k-medoids problem on a phylogenetic tree. parnas solves this problem efficiently and exactly by novel optimizations and adapting algorithms from operations research. For more nuanced selections, taxa can be weighted with metadata or genetic sequence parameters, and the pool of potential representatives can be user-constrained. Motivated by influenza A virus genomic surveillance and vaccine design, parnas can be applied to identify representative taxa that optimally cover the diversity in a phylogeny within a specified distance radius. We demonstrated that parnas is more efficient and flexible than existing approaches. To demonstrate its utility, we applied parnas to 1) quantify SARS-CoV-2 genetic diversity over time, 2) select representative influenza A virus in swine genes derived from over 5 years of genomic surveillance data, and 3) identify gaps in H3N2 human influenza A virus vaccine coverage. We suggest that our method, through the objective selection of representatives in a phylogeny, provides criteria for quantifying genetic diversity that has application in the the rational design of multivalent vaccines and genomic epidemiology. PARNAS is available at https://github.com/flu-crew/parnas.


Assuntos
Vírus da Influenza A Subtipo H3N2 , Vacinas , Animais , Humanos , Suínos , Filogenia , Vírus da Influenza A Subtipo H3N2/genética , Genômica
3.
Bioinformatics ; 38(8): 2144-2152, 2022 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-35150239

RESUMO

MOTIVATION: A phylogenetic network is a powerful model to represent entangled evolutionary histories with both divergent (speciation) and convergent (e.g. hybridization, reassortment, recombination) evolution. The standard approach to inference of hybridization networks is to (i) reconstruct rooted gene trees and (ii) leverage gene tree discordance for network inference. Recently, we introduced a method called RF-Net for accurate inference of virus reassortment and hybridization networks from input gene trees in the presence of errors commonly found in phylogenetic trees. While RF-Net demonstrated the ability to accurately infer networks with up to four reticulations from erroneous input gene trees, its application was limited by the number of reticulations it could handle in a reasonable amount of time. This limitation is particularly restrictive in the inference of the evolutionary history of segmented RNA viruses such as influenza A virus (IAV), where reassortment is one of the major mechanisms shaping the evolution of these pathogens. RESULTS: Here, we expand the functionality of RF-Net that makes it significantly more applicable in practice. Crucially, we introduce a fast extension to RF-Net, called Fast-RF-Net, that can handle large numbers of reticulations without sacrificing accuracy. In addition, we develop automatic stopping criteria to select the appropriate number of reticulations heuristically and implement a feature for RF-Net to output error-corrected input gene trees. We then conduct a comprehensive study of the original method and its novel extensions and confirm their efficacy in practice using extensive simulation and empirical IAV evolutionary analyses. AVAILABILITY AND IMPLEMENTATION: RF-Net 2 is available at https://github.com/flu-crew/rf-net-2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Vírus da Influenza A , Filogenia , Simulação por Computador , Vírus da Influenza A/genética , Evolução Molecular , Modelos Genéticos
4.
Bioinformatics ; 37(22): 4064-4074, 2021 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-34048529

RESUMO

MOTIVATION: The classic multispecies coalescent (MSC) model provides the means for theoretical justification of incomplete lineage sorting-aware species tree inference methods. This has motivated an extensive body of work on phylogenetic methods that are statistically consistent under MSC. One such particularly popular method is ASTRAL, a quartet-based species tree inference method. Novel studies suggest that ASTRAL also performs well when given multi-locus gene trees in simulation studies. Further, Legried et al. recently demonstrated that ASTRAL is statistically consistent under the gene duplication and loss model (GDL). GDL is prevalent in evolutionary histories and is the first core process in the powerful duplication-loss-coalescence evolutionary model (DLCoal) by Rasmussen and Kellis. RESULTS: In this work, we prove that ASTRAL is statistically consistent under the general DLCoal model. Therefore, our result supports the empirical evidence from the simulation-based studies. More broadly, we prove that the quartet-based inference approach is statistically consistent under DLCoal. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Duplicação Gênica , Filogenia , Simulação por Computador
5.
Bioinformatics ; 36(Suppl_2): i668-i674, 2020 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-33381825

RESUMO

MOTIVATION: The evolution of complexity is one of the most fascinating and challenging problems in modern biology, and tracing the evolution of complex traits is an open problem. In bacteria, operons and gene blocks provide a model of tractable evolutionary complexity at the genomic level. Gene blocks are structures of co-located genes with related functions, and operons are gene blocks whose genes are co-transcribed on a single mRNA molecule. The genes in operons and gene blocks typically work together in the same system or molecular complex. Previously, we proposed a method that explains the evolution of orthologous gene blocks (orthoblocks) as a combination of a small set of events that take place in vertical evolution from common ancestors. A heuristic method was proposed to solve this problem. However, no study was done to identify the complexity of the problem. RESULTS: Here, we establish that finding the homologous gene block problem is NP-hard and APX-hard. We have developed a greedy algorithm that runs in polynomial time and guarantees an O(ln⁡n) approximation. In addition, we formalize our problem as an integer linear program problem and solve it using the PuLP package and the standard CPLEX algorithm. Our exploration of several candidate operons reveals that our new method provides more optimal results than the results from the heuristic approach, and is significantly faster. AVAILABILITY AND IMPLEMENTATION: The software and data accompanying this paper are available under the GPLv3 and CC0 license respectively on: https://github.com/nguyenngochuy91/Relevant-Operon.


Assuntos
Genômica , Software , Algoritmos , Bactérias , Biologia Computacional , Dureza
6.
BMC Evol Biol ; 20(Suppl 1): 136, 2020 10 28.
Artigo em Inglês | MEDLINE | ID: mdl-33115401

RESUMO

BACKGROUND: Solving median tree problems under tree reconciliation costs is a classic and well-studied approach for inferring species trees from collections of discordant gene trees. These problems are NP-hard, and therefore are, in practice, typically addressed by local search heuristics. So far, however, such heuristics lack any provable correctness or precision. Further, even for small phylogenetic studies, it has been demonstrated that local search heuristics may only provide sub-optimal solutions. Obviating such heuristic uncertainties are exact dynamic programming solutions that allow solving tree reconciliation problems for smaller phylogenetic studies. Despite these promises, such exact solutions are only suitable for credibly rooted input gene trees, which constitute only a tiny fraction of the readily available gene trees. Standard gene tree inference approaches provide only unrooted gene trees and accurately rooting such trees is often difficult, if not impossible. RESULTS: Here, we describe complex dynamic programming solutions that represent the first nonnaïve exact solutions for solving the tree reconciliation problems for unrooted input gene trees. Further, we show that the asymptotic runtime of the proposed solutions does not increase when compared to the most time-efficient dynamic programming solutions for rooted input trees. CONCLUSIONS: In an experimental evaluation, we demonstrate that the described solutions for unrooted gene trees are, like the solutions for rooted input gene trees, suitable for smaller phylogenetic studies. Finally, for the first time, we study the accuracy of classic local search heuristics for unrooted tree reconciliation problems.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Algoritmos , Evolução Molecular , Incerteza
7.
Bioinformatics ; 35(17): 2998-3004, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30689726

RESUMO

MOTIVATION: Complexity is a fundamental attribute of life. Complex systems are made of parts that together perform functions that a single component, or subsets of components, cannot. Examples of complex molecular systems include protein structures such as the F1Fo-ATPase, the ribosome, or the flagellar motor: each one of these structures requires most or all of its components to function properly. Given the ubiquity of complex systems in the biosphere, understanding the evolution of complexity is central to biology. At the molecular level, operons are classic examples of a complex system. An operon's genes are co-transcribed under the control of a single promoter to a polycistronic mRNA molecule, and the operon's gene products often form molecular complexes or metabolic pathways. With the large number of complete bacterial genomes available, we now have the opportunity to explore the evolution of these complex entities, by identifying possible intermediate states of operons. RESULTS: In this work, we developed a maximum parsimony algorithm to reconstruct ancestral operon states, and show a simple vertical evolution model of how operons may evolve from the individual component genes. We describe several ancestral states that are plausible functional intermediate forms leading to the full operon. We also offer Reconstruction of Ancestral Gene blocks Using Events or ROAGUE as a software tool for those interested in exploring gene block and operon evolution. AVAILABILITY AND IMPLEMENTATION: The software accompanying this paper is available under GPLv3 license on: https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Evolução Molecular , Genoma Bacteriano , Óperon , Bactérias , Software
8.
BMC Bioinformatics ; 15 Suppl 13: S3, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25434729

RESUMO

BACKGROUND: Evolutionary studies are complicated by discordance between gene trees and the species tree in which they evolved. Dealing with discordant trees often relies on comparison costs between gene and species trees, including the well-established Robinson-Foulds, gene duplication, and deep coalescence costs. While these costs have provided credible results for binary rooted gene trees, corresponding cost definitions for non-binary unrooted gene trees, which are frequently occurring in practice, are challenged by biological realism. RESULT: We propose a natural extension of the well-established costs for comparing unrooted and non-binary gene trees with rooted binary species trees using a binary refinement model. For the duplication cost we describe an efficient algorithm that is based on a linear time reduction and also computes an optimal rooted binary refinement of the given gene tree. Finally, we show that similar reductions lead to solutions for computing the deep coalescence and the Robinson-Foulds costs. CONCLUSION: Our binary refinement of Robinson-Foulds, gene duplication, and deep coalescence costs for unrooted and non-binary gene trees together with the linear time reductions provided here for computing these costs significantly extends the range of trees that can be incorporated into approaches dealing with discordance.


Assuntos
Algoritmos , Evolução Biológica , Duplicação Gênica , Modelos Genéticos , Filogenia
9.
Bioinform Adv ; 4(1): vbae089, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38911822

RESUMO

Motivation: Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences. Results: Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes. Availability and implementation: TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.

10.
J Comput Biol ; 31(4): 312-327, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-38634854

RESUMO

Phylogenetic inference and reconstruction methods generate hypotheses on evolutionary history. Competing inference methods are frequently used, and the evaluation of the generated hypotheses is achieved using tree comparison costs. The Robinson-Foulds (RF) distance is a widely used cost to compare the topology of two trees, but this cost is sensitive to tree error and can overestimate tree differences. To overcome this limitation, a refined version of the RF distance called the Cluster Affinity (CA) distance was introduced. However, CA distances are symmetric and cannot compare different types of trees. These asymmetric comparisons occur when gene trees are compared with species trees, when disparate datasets are integrated into a supertree, or when tree comparison measures are used to infer a phylogenetic network. In this study, we introduce a relaxation of the original Affinity distance to compare heterogeneous trees called the asymmetric CA cost. We also develop a biologically interpretable cost, the Cluster Support cost that normalizes by cluster size across gene trees. The characteristics of these costs are similar to the symmetric CA cost. We describe efficient algorithms, derive the exact diameters, and use these to standardize the cost to be applicable in practice. These costs provide objective, fine-scale, and biologically interpretable values that can assess differences and similarities between phylogenetic trees.


Assuntos
Algoritmos , Filogenia , Análise por Conglomerados , Modelos Genéticos , Biologia Computacional/métodos , Evolução Molecular
11.
BMC Bioinformatics ; 13 Suppl 10: S14, 2012 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-22759419

RESUMO

BACKGROUND: Evolutionary methods are increasingly challenged by the wealth of fast growing resources of genomic sequence information. Evolutionary events, like gene duplication, loss, and deep coalescence, account more then ever for incongruence between gene trees and the actual species tree. Gene tree reconciliation is addressing this fundamental problem by invoking the minimum number of gene duplication and losses that reconcile a rooted gene tree with a rooted species tree. However, the reconciliation process is highly sensitive to topological error or wrong rooting of the gene tree, a condition that is not met by most gene trees in practice. Thus, despite the promises of gene tree reconciliation, its applicability in practice is severely limited. RESULTS: We introduce the problem of reconciling unrooted and erroneous gene trees by simultaneously rooting and error-correcting them, and describe an efficient algorithm for this problem. Moreover, we introduce an error-corrected version of the gene duplication problem, a standard application of gene tree reconciliation. We introduce an effective heuristic for our error-corrected version of the gene duplication problem, given that the original version of this problem is NP-hard. Our experimental results suggest that our error-correcting approaches for unrooted input trees can significantly improve on the accuracy of gene tree reconciliation, and the species tree inference under the gene duplication problem. Furthermore, the efficiency of our algorithm for error-correcting reconciliation is capable of handling truly large-scale phylogenetic studies. CONCLUSIONS: Our presented error-correction approach is a crucial step towards making gene tree reconciliation more robust, and thus to improve on the accuracy of applications that fundamentally rely on gene tree reconciliation, like the inference of gene-duplication supertrees.


Assuntos
Algoritmos , Evolução Molecular , Duplicação Gênica , Genômica/métodos , Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Software , Leveduras/genética
12.
BMC Bioinformatics ; 13 Suppl 10: S11, 2012 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-22759416

RESUMO

BACKGROUND: Gene tree - species tree reconciliation problems infer the patterns and processes of gene evolution within a species tree. Gene tree parsimony approaches seek the evolutionary scenario that implies the fewest gene duplications, duplications and losses, or deep coalescence (incomplete lineage sorting) events needed to reconcile a gene tree and a species tree. While a gene tree parsimony approach can be informative about genome evolution and phylogenetics, error in gene trees can profoundly bias the results. RESULTS: We introduce efficient algorithms that rapidly search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a given gene tree to identify a topology that implies the fewest duplications, duplication and losses, or deep coalescence events. These algorithms improve on the current solutions by a factor of n for searching SPR neighborhoods and n2 for searching TBR neighborhoods, where n is the number of taxa in the given gene tree. They provide a fast error correction protocol for ameliorating the effects of gene tree error by allowing small rearrangements in the topology to improve the reconciliation cost. We also demonstrate a simple protocol to use the gene rearrangement algorithm to improve gene tree parsimony phylogenetic analyses. CONCLUSIONS: The new gene tree rearrangement algorithms provide a fast method to address gene tree error. They do not make assumptions about the underlying processes of genome evolution, and they are amenable to analyses of large-scale genomic data sets. These algorithms are also easily incorporated into gene tree parsimony phylogenetic analyses, potentially producing more credible estimates of reconciliation cost.


Assuntos
Algoritmos , Evolução Molecular , Duplicação Gênica , Genômica/métodos , Biologia Computacional/métodos , Genoma , Modelos Teóricos , Filogenia , Leveduras/genética
13.
BMC Bioinformatics ; 13 Suppl 10: S12, 2012 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-22759417

RESUMO

BACKGROUND: To infer a species phylogeny from unlinked genes, phylogenetic inference methods must confront the biological processes that create incongruence between gene trees and the species phylogeny. Intra-specific gene variation in ancestral species can result in deep coalescence, also known as incomplete lineage sorting, which creates incongruence between gene trees and the species tree. One approach to account for deep coalescence in phylogenetic analyses is the deep coalescence problem, which takes a collection of gene trees and seeks the species tree that implies the fewest deep coalescence events. Although this approach is promising for phylogenetics, the consensus properties of this problem are mostly unknown and analyses of large data sets may be computationally prohibitive. RESULTS: We prove that the deep coalescence consensus tree problem satisfies the highly desirable Pareto property for clusters (clades). That is, in all instances, each cluster that is present in all of the input gene trees, called a consensus cluster, will also be found in every optimal solution. Moreover, we introduce a new divide and conquer method for the deep coalescence problem based on the Pareto property. This method refines the strict consensus of the input gene trees, thereby, in practice, often greatly reducing the complexity of the tree search and guaranteeing that the estimated species tree will satisfy the Pareto property. CONCLUSIONS: Analyses of both simulated and empirical data sets demonstrate that the divide and conquer method can greatly improve upon the speed of heuristics that do not consider the Pareto consensus property, while also guaranteeing that the proposed solution fulfills the Pareto property. The divide and conquer method extends the utility of the deep coalescence problem to data sets with enormous numbers of taxa.


Assuntos
Algoritmos , Biologia Computacional/métodos , Genômica/métodos , Filogenia , Análise por Conglomerados , Simulação por Computador
14.
BMC Bioinformatics ; 13 Suppl 10: S16, 2012 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-22759421

RESUMO

BACKGROUND: Biological networks provide fundamental insights into the functional characterization of genes and their products, the characterization of DNA-protein interactions, the identification of regulatory mechanisms, and other biological tasks. Due to the experimental and biological complexity, their computational exploitation faces many algorithmic challenges. RESULTS: We introduce novel weighted quasi-biclique problems to identify functional modules in biological networks when represented by bipartite graphs. In difference to previous quasi-biclique problems, we include biological interaction levels by using edge-weighted quasi-bicliques. While we prove that our problems are NP-hard, we also describe IP formulations to compute exact solutions for moderately sized networks. CONCLUSIONS: We verify the effectiveness of our IP solutions using both simulation and empirical data. The simulation shows high quasi-biclique recall rates, and the empirical data corroborate the abilities of our weighted quasi-bicliques in extracting features and recovering missing interactions from biological networks.


Assuntos
Algoritmos , Biologia Computacional/métodos , Biologia de Sistemas/métodos , Simulação por Computador , Mineração de Dados , Redes Reguladoras de Genes , Redes e Vias Metabólicas , Modelos Teóricos
15.
Syst Biol ; 60(2): 117-25, 2011 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21186249

RESUMO

Phylogenetic analyses using genome-scale data sets must confront incongruence among gene trees, which in plants is exacerbated by frequent gene duplications and losses. Gene tree parsimony (GTP) is a phylogenetic optimization criterion in which a species tree that minimizes the number of gene duplications induced among a set of gene trees is selected. The run time performance of previous implementations has limited its use on large-scale data sets. We used new software that incorporates recent algorithmic advances to examine the performance of GTP on a plant data set consisting of 18,896 gene trees containing 510,922 protein sequences from 136 plant taxa (giving a combined alignment length of >2.9 million characters). The relationships inferred from the GTP analysis were largely consistent with previous large-scale studies of backbone plant phylogeny and resolved some controversial nodes. The placement of taxa that were present in few gene trees generally varied the most among GTP bootstrap replicates. Excluding these taxa either before or after the GTP analysis revealed high levels of phylogenetic support across plants. The analyses supported magnoliids sister to a eudicot + monocot clade and did not support the eurosid I and II clades. This study presents a nuclear genomic perspective on the broad-scale phylogenic relationships among plants, and it demonstrates that nuclear genes with a history of duplication and loss can be phylogenetically informative for resolving the plant tree of life.


Assuntos
Classificação/métodos , Filogenia , Plantas/classificação , Plantas/genética , Algoritmos , Etiquetas de Sequências Expressas , Genômica
16.
BMC Bioinformatics ; 12 Suppl 1: S15, 2011 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-21342544

RESUMO

BACKGROUND: The abundance of new genomic data provides the opportunity to map the location of gene duplication and loss events on a species phylogeny. The first methods for mapping gene duplications and losses were based on a parsimony criterion, finding the mapping that minimizes the number of duplication and loss events. Probabilistic modeling of gene duplication and loss is relatively new and has largely focused on birth-death processes. RESULTS: We introduce a new maximum likelihood model that estimates the speciation and gene duplication and loss events in a gene tree within a species tree with branch lengths. We also provide an, in practice, efficient algorithm that computes optimal evolutionary scenarios for this model. We implemented the algorithm in the program DrML and verified its performance with empirical and simulated data. CONCLUSIONS: In test data sets, DrML finds optimal gene duplication and loss scenarios within minutes, even when the gene trees contain sequences from several hundred species. In many cases, these optimal scenarios differ from the lca-mapping that results from a parsimony gene tree reconciliation. Thus, DrML provides a new, practical statistical framework on which to study gene duplication.


Assuntos
Algoritmos , Evolução Molecular , Duplicação Gênica , Funções Verossimilhança , Simulação por Computador , Especiação Genética , Genoma , Genômica/métodos , Filogenia , Software
17.
BMC Bioinformatics ; 12 Suppl 1: S14, 2011 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-21342543

RESUMO

BACKGROUND: The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee. RESULTS: We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6,084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics. CONCLUSIONS: Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1,000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.


Assuntos
Duplicação Gênica , Filogenia , Plantas/genética , Programação Linear , Algoritmos , Simulação por Computador , Genoma de Planta , Genômica/métodos
18.
J Comput Biol ; 28(8): 758-773, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34125600

RESUMO

The duplication-loss-coalescence (DLC) parsimony model is invaluable for analyzing the complex scenarios of concurrent duplication loss and deep coalescence events in the evolution of gene families. However, inferring such scenarios for already moderately sized families is prohibitive owing to the computational complexity involved. To overcome this stringent limitation, we make the first step by describing a flexible integer linear programming (ILP) formulation for inferring DLC evolutionary scenarios. Then, to make the DLC model more scalable, we introduce four sensibly constrained versions of the model and describe modified versions of our ILP formulation reflecting these constraints. Our simulation studies showcase that our constrained ILP formulations compute evolutionary scenarios that are substantially larger than scenarios computable under our original ILP formulation and the original dynamic programming algorithm by Wu et al. Furthermore, scenarios computed under our constrained DLC models are remarkably accurate compared with corresponding scenarios under the original DLC model, which we also confirm in an empirical study with thousands of gene families.


Assuntos
Biologia Computacional/métodos , Família Multigênica , Algoritmos , Evolução Molecular , Duplicação Gênica , Modelos Genéticos , Filogenia , Programação Linear
19.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2125-2135, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-31150345

RESUMO

Tree reconciliation costs are a popular choice to account for the discordance between the evolutionary history of a gene family (i.e., a gene tree), and the species tree through which this family has evolved. This discordance is accounted for by the minimum number of postulated evolutionary events necessary for reconciling the two trees. Such events include gene duplication, loss, and deep coalescence, and are used to define different types of tree reconciliation costs. For example, the duplication-loss cost for a gene tree and species tree accounts for the minimum number of gene duplications and losses necessary to reconcile these trees. Fundamental to the understanding of how gene trees and species trees relate to each other are the diameters of tree reconciliation costs. While such diameters have been well-researched, still absent from these studies are the unconstrained diameters for two of the classic tree reconciliation costs, namely the duplication-loss cost and the loss cost. Here, we show the essential mathematical properties of these diameters and provide efficient solutions for computing them. Finally, we analyze the distributions of these diameters using simulated datasets.


Assuntos
Biologia Computacional/métodos , Duplicação Gênica/genética , Modelos Genéticos , Evolução Molecular , Filogenia
20.
BMC Bioinformatics ; 11 Suppl 1: S42, 2010 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-20122216

RESUMO

BACKGROUND: Genomic data provide a wealth of new information for phylogenetic analysis. Yet making use of this data requires phylogenetic methods that can efficiently analyze extremely large data sets and account for processes of gene evolution, such as gene duplication and loss, incomplete lineage sorting (deep coalescence), or horizontal gene transfer, that cause incongruence among gene trees. One such approach is gene tree parsimony, which, given a set of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, the only existing algorithms for gene tree parsimony under the duplication-loss or deep coalescence reconciliation cost are prohibitively slow for large datasets. RESULTS: We describe novel algorithms for SPR and TBR based local search heuristics under the duplication-loss cost, and we show how they can be adapted for the deep coalescence cost. These algorithms improve upon the best existing algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. We implemented our new SPR based local search algorithm for the duplication-loss cost and demonstrate the tremendous improvement in runtime and scalability it provides compared to existing implementations. We also evaluate the performance of our algorithm on three large-scale genomic data sets. CONCLUSION: Our new algorithms enable, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication-loss and deep coalescence reconciliation costs. Thus, this work expands both the size of data sets and the range of evolutionary models that can be incorporated into genome-scale phylogenetic analyses.


Assuntos
Duplicação Gênica , Genoma , Filogenia , Algoritmos , Evolução Molecular
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA