*IEEE/ACM Trans Comput Biol Bioinform ; 2019 Oct 14.*

##### RESUMO

Solving median tree problems is a classic approach for inferring species trees from a collection of discordant gene trees. Median tree problems are typically NP-hard and dealt with by local search heuristics. Unfortunately, such heuristics generally lack provable correctness and precision. Algorithmic advances addressing this uncertainty have led to exact dynamic programming formulations suitable to solve a well-studied group of median tree problems for smaller phylogenetic analyses. However, these formulations allow computing only very few optimal species trees out of possibly many such trees, and phylogenetic studies often require the analysis of all optimal solutions through their consensus tree. Here, we describe a significant algorithmic modification of the dynamic programming formulations that compute the cluster counts of all optimal species trees from which various types of consensus trees can be efficiently computed. Through experimental studies, we demonstrate that our parallel implementation of the modified dynamic programming formulation is more efficient than a previous implementation of the original formulation. Finally, we show that the parallel implementation can rapidly identify novel reassorted influenza A viruses potentially facilitating pandemic preparedness efforts.

*IEEE/ACM Trans Comput Biol Bioinform ; 2019 May 28.*

##### RESUMO

Tree reconciliation costs are a popular choice to account for the discordance between the evolutionary history of a gene family (i.e., a gene tree), and the species tree through which this family has evolved. This discordance is accounted for by the minimum number of postulated evolutionary events necessary for reconciling the two trees. Such events include gene duplication, loss, and deep coalescence, and are used to define different types of tree reconciliation costs. For example, the duplication-loss cost for a gene tree and species tree accounts for the minimum number of gene duplications and losses necessary to reconcile these trees. Fundamental to the understanding of how gene trees and species trees relate to each other are the diameters of tree reconciliation costs. While such diameters have been well-researched, still absent from these studies are the unconstrained diameters for two of the classic tree reconciliation costs, namely the duplication-loss cost and the loss cost. Here, we show the essential mathematical properties of these diameters and provide efficient solutions for computing them. Finally, we analyze the distributions of these diameters using simulated datasets.

*Bioinformatics ; 35(17): 2998-3004, 2019 Sep 01.*

##### RESUMO

MOTIVATION: Complexity is a fundamental attribute of life. Complex systems are made of parts that together perform functions that a single component, or subsets of components, cannot. Examples of complex molecular systems include protein structures such as the F1Fo-ATPase, the ribosome, or the flagellar motor: each one of these structures requires most or all of its components to function properly. Given the ubiquity of complex systems in the biosphere, understanding the evolution of complexity is central to biology. At the molecular level, operons are classic examples of a complex system. An operon's genes are co-transcribed under the control of a single promoter to a polycistronic mRNA molecule, and the operon's gene products often form molecular complexes or metabolic pathways. With the large number of complete bacterial genomes available, we now have the opportunity to explore the evolution of these complex entities, by identifying possible intermediate states of operons. RESULTS: In this work, we developed a maximum parsimony algorithm to reconstruct ancestral operon states, and show a simple vertical evolution model of how operons may evolve from the individual component genes. We describe several ancestral states that are plausible functional intermediate forms leading to the full operon. We also offer Reconstruction of Ancestral Gene blocks Using Events or ROAGUE as a software tool for those interested in exploring gene block and operon evolution. AVAILABILITY AND IMPLEMENTATION: The software accompanying this paper is available under GPLv3 license on: https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

*IEEE/ACM Trans Comput Biol Bioinform ; 16(5): 1459-1470, 2019.*

##### RESUMO

Median tree inference under path-difference metrics has shown great promise for large-scale phylogeny estimation. Similar to these metrics is the family of cophenetic metrics that originates from a classic dendrogram comparison method introduced more than 50 years ago. Despite the appeal of this family of metrics, the problem of computing median trees under cophenetic metrics has not been analyzed. Like other standard median tree problems relevant in practice, as we show here, this problem is also NP-hard. NP-hard median tree problems have been successfully addressed by local search heuristics that are solving thousands of instances of a corresponding (local neighborhood) search problem. For the local neighborhood search problem under a cophenetic metric, the best known (naïve) algorithm has a time complexity that is typically prohibitive for effective heuristic searches. Building on the pioneering work on path-difference median trees, we develop efficient algorithms for Manhattan and Euclidean cophenetic search problems that improve on the naïve solution by a linear and a quadratic factor, respectively. We demonstrate the performance and effectiveness of the resulting heuristic methods in a comparative study using benchmark empirical datasets.

##### Assuntos

Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Algoritmos , Bases de Dados Genéticas , Técnicas Genéticas , Heurística*IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1063-1076, 2019.*

##### RESUMO

Median tree problems are powerful tools for inferring large-scale phylogenetic trees that hold enormous promise for society at large. Such problems seek a median tree for a given collection of input trees under some problem-specific distance. Here, we introduce a median tree problem under the classic Manhattan path-difference distance. We show that this problem is NP-hard, devise an ILP formulation, and provide an effective local search heuristic that is based on solving a local search problem exactly. Our algorithm for the local search problem improves asymptotically by a factor of $n$n on the best-known (naïve) solution, where $n$n is the overall number of taxa in the input trees. Finally, comparative phylogenetic studies using considerably large empirical data and an accuracy analysis for smaller phylogenetic trees reveal the ability of our novel heuristic.

##### Assuntos

Biologia Computacional/métodos , Filogenia , Software , Algoritmos , Animais , Classificação , Análise por Conglomerados , Evolução Molecular , Marsupiais , Modelos Genéticos , Linguagens de Programação*IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1374-1385, 2019.*

##### RESUMO

Synthesizing large-scale phylogenetic trees is a fundamental problem in evolutionary biology. Median tree problems have evolved as a powerful tool to reconstruct such trees. Such problems seek a median tree for a given collection of input trees under some problem-specific tree distance. There has been an increased interest in the median tree problem for the classical path-difference distance between trees. While this problem is NP-hard, standard local search heuristics have been described that are based on solving a local search problem exactly. For a more effective heuristic we devise a time efficient algorithm for the local search problem that improves on the best-known solution by a factor of $n$n, where $n$n is the size of the input trees. Furthermore, we introduce a novel hybrid version of the standard local search that is exploiting our new algorithm for a more refined heuristic search. Finally, we demonstrate the performance of our hybrid heuristic in a comparative study with other commonly used methods that synthesize species trees using published empirical data sets.

##### Assuntos

Biologia Computacional/métodos , Filogenia , Algoritmos , Animais , Simulação por Computador , Evolução Molecular , Internet , Marsupiais , Modelos Genéticos , Software , Especificidade da Espécie*J Bioinform Comput Biol ; 16(5): 1802004, 2018 10.*

##### Assuntos

Biologia Computacional , Genômica/métodos , Algoritmos , Congressos como Assunto , Genoma Humano , Humanos , Mutação , Filogenia , Proteínas/genética*IEEE/ACM Trans Comput Biol Bioinform ; 15(5): 1723-1727, 2018.*

##### RESUMO

Synthesizing median trees from a collection of gene trees under the biologically motivated gene tree parsimony (GTP) costs has provided credible species tree estimates. GTP costs are defined for each of the classic evolutionary processes. These costs count the minimum number of events necessary to reconcile the gene tree with the species tree where the leaf-genes are mapped to the leaf-species through a function called labeling. To better understand the synthesis of median trees under these costs, there is an increased interest in analyzing their diameters. The diameters of a GTP cost between a gene tree and a species tree are the maximum values of this cost of one or both topologies of the trees involved. We are concerned about the diameters of the GTP costs under bijective labelings. While these diameters are linear time computable for the gene duplication and deep coalescence costs, this has been unknown for the classic gene duplication and loss, and for the loss cost. For the first time, we show how to compute these diameters and proof that this can be achieved in linear time, and thus, completing the computational time analysis for all of the bijective diameters under the GTP costs.

##### Assuntos

Biologia Computacional/métodos , Evolução Molecular , Modelos Genéticos , Filogenia , Duplicação Gênica*J Bioinform Comput Biol ; 15(6): 1702004, 2017 12.*

##### Assuntos

Biologia Computacional/métodos , Metagenoma , Filogenia , Algoritmos , Congressos como Assunto , RNA/química*J Bioinform Comput Biol ; 15(3): 1740002, 2017 Jun.*

##### RESUMO

Supertree problems are a standard tool for synthesizing large-scale species trees from a given collection of gene trees under some problem-specific objective. Unfortunately, these problems are typically NP-hard, and often remain so when their instances are restricted to rooted gene trees sampled from the same species. While a class of restricted supertree problems has been effectively addressed by the parameterized strict consensus approach, in practice, most gene trees are unrooted and sampled from different species. Here, we overcome this stringent limitation by describing efficient algorithms that are adopting the strict consensus approach to also handle unrestricted supertree problems. Finally, we demonstrate the performance of our algorithms in a comparative study with classic supertree heuristics using simulated and empirical data sets.

##### Assuntos

Algoritmos , Filogenia , Biologia Computacional/métodos , Simulação por Computador , Duplicação Gênica*IEEE/ACM Trans Comput Biol Bioinform ; 14(5): 1002-1012, 2017.*

##### RESUMO

The minimizing-deep-coalescence (MDC) approach infers a median (species) tree for a given set of gene trees under the deep coalescence cost. This cost accounts for the minimum number of deep coalescences needed to reconcile a gene tree with a species tree where the leaf-genes are mapped to the leaf-species through a function called leaf labeling. In order to better understand the MDC approach we investigate here the diameter of a gene tree, which is an important property of the deep coalescence cost. This diameter is the maximal deep coalescence costs for a given gene tree under all leaf labelings for each possible species tree topology. While we prove that this diameter is generally infinite, this result relies on the diameter's unrealistic assumption that species trees can be of infinite size. Providing a more practical definition, we introduce a natural extension of the gene tree diameter that constrains the species tree size by a given constant. For this new diameter, we describe an exact formula, present a complete classification of the trees yielding this diameter, derive formulas for its mean and variance, and demonstrate its ability using comparative studies.

##### Assuntos

Biologia Computacional/métodos , Evolução Molecular , Genes/genética , Especiação Genética , Modelos Genéticos , Algoritmos , Filogenia*J Bioinform Comput Biol ; 14(3): 1642005, 2016 06.*

##### RESUMO

Solving the gene duplication problem is a classical approach for species tree inference from gene trees that are confounded by gene duplications. This problem takes a collection of gene trees and seeks a species tree that implies the minimum number of gene duplications. Wilkinson et al. posed the conjecture that the gene duplication problem satisfies the desirable Pareto property for clusters. That is, for every instance of the problem, all clusters that are commonly present in the input gene trees of this instance, called strict consensus, will also be found in every solution to this instance. We prove that this conjecture does not generally hold. Despite this negative result we show that the gene duplication problem satisfies a weaker version of the Pareto property where the strict consensus is found in at least one solution (rather than all solutions). This weaker property contributes to our design of an efficient scalable algorithm for the gene duplication problem. We demonstrate the performance of our algorithm in analyzing large-scale empirical datasets. Finally, we utilize the algorithm to evaluate the accuracy of standard heuristics for the gene duplication problem using simulated datasets.

##### Assuntos

Algoritmos , Duplicação Gênica , Biologia Computacional/métodos , Modelos Genéticos*IEEE/ACM Trans Comput Biol Bioinform ; 12(1): 155-65, 2015.*

##### RESUMO

The deep coalescence cost accounts for discord caused by deep coalescence between a gene tree and a species tree. It is a major concern that the diameter of a gene tree (the tree's maximum deep coalescence cost across all species trees) depends on its topology, which can largely obfuscate phylogenetic studies. While this bias can be compensated by normalizing the deep coalescence cost using diameters, obtaining them efficiently has been posed as an open problem by Than and Rosenberg. Here, we resolve this problem by describing a linear time algorithm to compute the diameter of a gene tree. In addition, we provide a complete classification of the species trees yielding this diameter to guide phylogenetic analyses.

##### Assuntos

Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Evolução Molecular*BMC Bioinformatics ; 15 Suppl 13: S1, 2014.*

*BMC Bioinformatics ; 15 Suppl 13: S3, 2014.*

##### RESUMO

BACKGROUND: Evolutionary studies are complicated by discordance between gene trees and the species tree in which they evolved. Dealing with discordant trees often relies on comparison costs between gene and species trees, including the well-established Robinson-Foulds, gene duplication, and deep coalescence costs. While these costs have provided credible results for binary rooted gene trees, corresponding cost definitions for non-binary unrooted gene trees, which are frequently occurring in practice, are challenged by biological realism. RESULT: We propose a natural extension of the well-established costs for comparing unrooted and non-binary gene trees with rooted binary species trees using a binary refinement model. For the duplication cost we describe an efficient algorithm that is based on a linear time reduction and also computes an optimal rooted binary refinement of the given gene tree. Finally, we show that similar reductions lead to solutions for computing the deep coalescence and the Robinson-Foulds costs. CONCLUSION: Our binary refinement of Robinson-Foulds, gene duplication, and deep coalescence costs for unrooted and non-binary gene trees together with the linear time reductions provided here for computing these costs significantly extends the range of trees that can be incorporated into approaches dealing with discordance.

##### Assuntos

Algoritmos , Evolução Biológica , Duplicação Gênica , Modelos Genéticos , Filogenia*IEEE/ACM Trans Comput Biol Bioinform ; 11(1): 231-42, 2014.*

##### RESUMO

The minimizing deep coalescence (MDC) problem seeks a species tree that reconciles the given gene trees with the minimum number of deep coalescence events, called deep coalescence (DC) cost. To better assess MDC species trees we investigate into a basic mathematical property of the DC cost, called the diameter. Given a gene tree, a species tree, and a leaf labeling function that assigns leaf-genes of the gene tree to a leaf-species in the species tree from which they were sampled, the DC cost describes the discordance between the trees caused by deep coalescence events. The diameter of a gene tree and a species tree is the maximum DC cost across all leaf labelings for these trees. We prove fundamental mathematical properties describing precisely these diameters for bijective and general leaf labelings, and present efficient algorithms to compute the diameters and their corresponding leaf labelings. In particular, we describe an optimal, i.e., linear time, algorithm for the bijective case. Finally, in an experimental study we demonstrate that the average diameters between a gene tree and a species tree grow significantly slower than their naive upper bounds, suggesting that our exact bounds can significantly improve on assessing DC costs when using diameters.

##### Assuntos

Biologia Computacional/métodos , Evolução Molecular , Genes/genética , Modelos Genéticos , Análise de Sequência de DNA/métodos*IEEE/ACM Trans Comput Biol Bioinform ; 11(3): 453-4, 2014.*

*J Comput Biol ; 21(1): 89-98, 2014 Jan.*

##### RESUMO

DrML is a software program for inferring evolutionary scenarios from a gene tree and a species tree with speciation time estimates that is based on a general maximum likelihood model. The program implements novel algorithms that efficiently infer most likely scenarios of gene duplication and loss events. Our comparative studies suggest that the general maximum likelihood model provides more credible estimates than standard parsimony reconciliation, especially when speciation times differ significantly. DrML is an open source project written in Python, and along with an on-line manual and sample data sets publicly available.

##### Assuntos

Duplicação Gênica , Modelos Genéticos , Software , Animais , Biologia Computacional , Evolução Molecular , Humanos , Funções Verossimilhança , Modelos Estatísticos , Filogenia*IEEE/ACM Trans Comput Biol Bioinform ; 10(4): 939-56, 2013.*

##### RESUMO

The use of genomic data sets for phylogenetics is complicated by the fact that evolutionary processes such as gene duplication and loss, or incomplete lineage sorting (deep coalescence) cause incongruence among gene trees. One well-known approach that deals with this complication is gene tree parsimony, which, given a collection of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, a lack of efficient algorithms has limited the use of this approach. Here, we present efficient algorithms for SPR and TBR-based local search heuristics for gene tree parsimony under the 1) duplication, 2) loss, 3) duplication-loss, and 4) deep coalescence reconciliation costs. These novel algorithms improve upon the time complexities of previous algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. Our algorithms provide a substantial improvement in runtime and scalability compared to previous implementations and enable large-scale gene tree parsimony analyses using any of the four reconciliation costs. Our algorithms have been implemented in the software packages DupTree and iGTP, and have already been used to perform several compelling phylogenetic studies.