RESUMEN
MOTIVATION: Simulation is an essential technique for generating biomolecular data with a 'known' history for use in validating phylogenetic inference and other evolutionary methods. On longer time scales, simulation supports investigations of equilibrium behavior and provides a formal framework for testing competing evolutionary hypotheses. Twenty years of molecular evolution research have produced a rich repertoire of simulation methods. However, current models do not capture the stringent constraints acting on the domain insertions, duplications, and deletions by which multidomain architectures evolve. Although these processes have the potential to generate any combination of domains, only a tiny fraction of possible domain combinations are observed in nature. Modeling these stringent constraints on domain order and co-occurrence is a fundamental challenge in domain architecture simulation that does not arise with sequence and gene family simulation. RESULTS: Here, we introduce a stochastic model of domain architecture evolution to simulate evolutionary trajectories that reflect the constraints on domain order and co-occurrence observed in nature. This framework is implemented in a novel domain architecture simulator, DomArchov, using the Metropolis-Hastings algorithm with data-driven transition probabilities. The use of a data-driven event module enables quick and easy redeployment of the simulator for use in different taxonomic and protein function contexts. Using empirical evaluation with metazoan datasets, we demonstrate that domain architectures simulated by DomArchov recapitulate properties of genuine domain architectures that reflect the constraints on domain order and adjacency seen in nature. This work expands the realm of evolutionary processes that are amenable to simulation. AVAILABILITY AND IMPLEMENTATION: DomArchov is written in Python 3 and is available at http://www.cs.cmu.edu/~durand/DomArchov. The data underlying this article are available via the same link. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Evolución Molecular , Proteínas , Algoritmos , Animales , Simulación por Computador , Filogenia , Proteínas/genéticaRESUMEN
Motivation: Orthology analysis is a fundamental tool in comparative genomics. Sophisticated methods have been developed to distinguish between orthologs and paralogs and to classify paralogs into subtypes depending on the duplication mechanism and timing, relative to speciation. However, no comparable framework exists for xenologs: gene pairs whose history, since their divergence, includes a horizontal transfer. Further, the diversity of gene pairs that meet this broad definition calls for classification of xenologs with similar properties into subtypes. Results: We present a xenolog classification that uses phylogenetic reconciliation to assign each pair of genes to a class based on the event responsible for their divergence and the historical association between genes and species. Our classes distinguish between genes related through transfer alone and genes related through duplication and transfer. Further, they separate closely-related genes in distantly-related species from distantly-related genes in closely-related species. We present formal rules that assign gene pairs to specific xenolog classes, given a reconciled gene tree with an arbitrary number of duplications and transfers. These xenology classification rules have been implemented in software and tested on a collection of â¼13 000 prokaryotic gene families. In addition, we present a case study demonstrating the connection between xenolog classification and gene function prediction. Availability and Implementation: The xenolog classification rules have been implemented in N otung 2.9, a freely available phylogenetic reconciliation software package. http://www.cs.cmu.edu/~durand/Notung . Gene trees are available at http://dx.doi.org/10.7488/ds/1503 . Contact: durand@cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Genes Bacterianos , Genómica/métodos , Filogenia , Programas Informáticos , Algoritmos , Bacterias/genética , Evolución Molecular , Homología de Secuencia de Ácido NucleicoRESUMEN
Gene functions, interactions, disease associations, and ecological distributions are all correlated with gene age. However, it is challenging to estimate the intricate series of evolutionary events leading to a modern-day gene and then to reduce this history to a single age estimate. Focusing on eukaryotic gene families, we introduce a framework that can be used to compare current strategies for quantifying gene age, discuss key differences between these methods, and highlight several common problems. We argue that genes with complex evolutionary histories do not have a single well-defined age. As a result, care must be taken to articulate the goals and assumptions of any analysis that uses gene age estimates. Recent algorithmic advances offer the promise of gene age estimates that are fast, accurate, and consistent across gene families. This will enable a shift to integrated genome-wide analyses of all events in gene evolutionary histories in the near future.
Asunto(s)
Evolución Molecular , Genes/fisiología , Modelos Genéticos , Paleontología/métodos , Biología Computacional , Bases de Datos Genéticas , FilogeniaRESUMEN
BACKGROUND: Reconstructing evolution provides valuable insights into the processes of gene evolution and function. However, while there have been great advances in algorithms and software to reconstruct the history of gene families, these tools do not model the domain shuffling events (domain duplication, insertion, transfer, and deletion) that drive the evolution of multidomain protein families. Protein evolution through domain shuffling events allows for rapid exploration of functions by introducing new combinations of existing folds. This powerful mechanism was key to some significant evolutionary innovations, such as multicellularity and the vertebrate immune system. A method for reconstructing this important evolutionary process is urgently needed. RESULTS: Here, we introduce a novel, event-based framework for studying multidomain evolution by reconciling a domain tree with a gene tree, with additional information provided by the species tree. In the context of this framework, we present the first reconciliation algorithms to infer domain shuffling events, while addressing the challenges inherent in the inference of evolution across three levels of organization. CONCLUSIONS: We apply these methods to the evolution of domains in the Membrane associated Guanylate Kinase family. These case studies reveal a more vivid and detailed evolutionary history than previously provided. Our algorithms have been implemented in software, freely available at http://www.cs.cmu.edu/Ëdurand/Notung.
Asunto(s)
Algoritmos , Evolución Molecular , Guanilato-Quinasas/genética , Familia de Multigenes , Filogenia , Programas Informáticos , Animales , Duplicación de Gen , Estructura Terciaria de Proteína , Vertebrados/genéticaRESUMEN
BACKGROUND: Phylogenetic birth-death models are opening a new window on the processes of genome evolution in studies of the evolution of gene and protein families, protein-protein interaction networks, microRNAs, and copy number variation. Given a species tree and a set of genomic characters in present-day species, the birth-death approach estimates the most likely rates required to explain the observed data and returns the expected ancestral character states and the history of character state changes. Achieving a balance between model complexity and generalizability is a fundamental challenge in the application of birth-death models. While more parameters promise greater accuracy and more biologically realistic models, increasing model complexity can lead to overfitting and a heavy computational cost. RESULTS: Here we present a systematic, empirical investigation of these tradeoffs, using protein domain families in six metazoan genomes as a case study. We compared models of increasing complexity, implemented in the Count program, with respect to model fit, robustness, and stability. In addition, we used a bootstrapping procedure to assess estimator variability. The results show that the most complex model, which allows for both branch-specific and family-specific rate variation, achieves the best fit, without overfitting. Variance remains low with increasing complexity, except for family-specific loss rates. This variance is reduced when the number of discrete rate categories is increased. CONCLUSIONS: The work presented here evaluates model choice for genomic birth-death models in a systematic way and presents the first use of bootstrapping to assess estimator variance in birth-death models. We find that a model incorporating both lineage and family rate variation yields more accurate estimators without sacrificing generality. Our results indicate that model choice can lead to fundamentally different evolutionary conclusions, emphasizing the importance of more biologically realistic and complex models.
Asunto(s)
Evolución Molecular , Genoma , Genómica/métodos , Modelos Genéticos , FilogeniaRESUMEN
MOTIVATION: Gene duplication (D), transfer (T), loss (L) and incomplete lineage sorting (I) are crucial to the evolution of gene families and the emergence of novel functions. The history of these events can be inferred via comparison of gene and species trees, a process called reconciliation, yet current reconciliation algorithms model only a subset of these evolutionary processes. RESULTS: We present an algorithm to reconcile a binary gene tree with a nonbinary species tree under a DTLI parsimony criterion. This is the first reconciliation algorithm to capture all four evolutionary processes driving tree incongruence and the first to reconcile non-binary species trees with a transfer model. Our algorithm infers all optimal solutions and reports complete, temporally feasible event histories, giving the gene and species lineages in which each event occurred. It is fixed-parameter tractable, with polytime complexity when the maximum species outdegree is fixed. Application of our algorithms to prokaryotic and eukaryotic data show that use of an incomplete event model has substantial impact on the events inferred and resulting biological conclusions. AVAILABILITY: Our algorithms have been implemented in Notung, a freely available phylogenetic reconciliation software package, available at http://www.cs.cmu.edu/~durand/Notung. CONTACT: mstolzer@andrew.cmu.edu.
Asunto(s)
Algoritmos , Evolución Molecular , Familia de Multigenes , Duplicación de Gen , Transferencia de Gen Horizontal , Modelos Genéticos , Filogenia , Programas InformáticosRESUMEN
Homotypic membrane fusion catalyzed by the atlastin (ATL) GTPase sustains the branched endoplasmic reticulum (ER) network in metazoans. Our recent discovery that two of the three human ATL paralogs (ATL1/2) are C-terminally autoinhibited implied that relief of autoinhibition would be integral to the ATL fusion mechanism. An alternative hypothesis is that the third paralog ATL3 promotes constitutive ER fusion with relief of ATL1/2 autoinhibition used conditionally. However, published studies suggest ATL3 is a weak fusogen at best. Contrary to expectations, we demonstrate here that purified human ATL3 catalyzes efficient membrane fusion in vitro and is sufficient to sustain the ER network in triple knockout cells. Strikingly, ATL3 lacks any detectable C-terminal autoinhibition, like the invertebrate Drosophila ATL ortholog. Phylogenetic analysis of ATL C-termini indicates that C-terminal autoinhibition is a recent evolutionary innovation. We suggest that ATL3 is a constitutive ER fusion catalyst and that ATL1/2 autoinhibition likely evolved in vertebrates as a means of upregulating ER fusion activity on demand.
Asunto(s)
GTP Fosfohidrolasas , Fusión de Membrana , Animales , Humanos , Drosophila , GTP Fosfohidrolasas/genética , FilogeniaRESUMEN
The exon shuffling theory posits that intronic recombination creates new domain combinations, facilitating the evolution of novel protein function. This theory predicts that introns will be preferentially situated near domain boundaries. Many studies have sought evidence for exon shuffling by testing the correspondence between introns and domain boundaries against chance intron positioning. Here, we present an empirical investigation of how the choice of null model influences significance. Although genome-wide studies have used a uniform null model, exclusively, more realistic null models have been proposed for single gene studies. We extended these models for genome-wide analyses and applied them to 21 metazoan and fungal genomes. Our results show that compared with the other two models, the uniform model does not recapitulate genuine exon lengths, dramatically underestimates the probability of chance agreement, and overestimates the significance of intron-domain correspondence by as much as 100 orders of magnitude. Model choice had much greater impact on the assessment of exon shuffling in fungal genomes than in metazoa, leading to different evolutionary conclusions in seven of the 16 fungal genomes tested. Genome-wide studies that use this overly permissive null model may exaggerate the importance of exon shuffling as a general mechanism of multidomain evolution.
Asunto(s)
Estudio de Asociación del Genoma Completo , Genoma , Animales , Evolución Molecular , Exones , Intrones , ProteínasRESUMEN
Streptococcus pneumoniae (pneumococcus) displays broad tissue tropism and infects multiple body sites in the human host. However, infections of the conjunctiva are limited to strains within a distinct phyletic group with multilocus sequence types ST448, ST344, ST1186, ST1270, and ST2315. In this study, we sequenced the genomes of six pneumococcal strains isolated from eye infections. The conjunctivitis isolates are grouped in a distinct phyletic group together with a subset of nasopharyngeal isolates. The keratitis (infection of the cornea) and endophthalmitis (infection of the vitreous body) isolates are grouped with the remainder of pneumococcal strains. Phenotypic characterization is consistent with morphological differences associated with the distinct phyletic group. Specifically, isolates from the distinct phyletic group form aggregates in planktonic cultures and chain-like structures in biofilms grown on abiotic surfaces. To begin to investigate the association between genotype and epidemiology, we focused on a predicted surface-exposed adhesin (SspB) encoded exclusively by this distinct phyletic group. Phylogenetic analysis of the gene encoding SspB in the context of a streptococcal species tree suggests that sspB was acquired by lateral gene transfer from Streptococcus suis. Furthermore, an sspB deletion mutant displays decreased adherence to cultured cells from the ocular epithelium compared to the isogenic wild-type and complemented strains. Together these findings suggest that acquisition of genes from outside the species has contributed to pneumococcal tissue tropism by enhancing the ability of a subset of strains to infect the ocular epithelium causing conjunctivitis. IMPORTANCE Changes in the gene content of pathogens can modify their ability to colonize and/or survive in different body sites in the human host. In this study, we investigate a gene acquisition event and its role in the pathogenesis of Streptococccus pneumoniae (pneumococcus). Our findings suggest that the gene encoding the predicted surface protein SspB has been transferred from Streptococcus suis (a distantly related streptococcal species) into a distinct set of pneumococcal strains. This group of strains distinguishes itself from the remainder of pneumococcal strains by extensive differences in genomic composition and by the ability to cause conjunctivitis. We find that the presence of sspB increases adherence of pneumococcus to the ocular epithelium. Thus, our data support the hypothesis that a subset of pneumococcal strains has gained genes from neighboring species that enhance their ability to colonize the epithelium of the eye, thus expanding into a new niche.
RESUMEN
Reconciliation extracts information from the topological incongruence between gene and species trees to infer duplications and losses in the history of a gene family. The inferred duplication-loss histories provide valuable information for a broad range of biological applications, including ortholog identification, estimating gene duplication times, and rooting and correcting gene trees. While reconciliation for binary trees is a tractable and well studied problem, there are no algorithms for reconciliation with non-binary species trees. Yet a striking proportion of species trees are non-binary. For example, 64% of branch points in the NCBI taxonomy have three or more children. When applied to non-binary species trees, current algorithms overestimate the number of duplications because they cannot distinguish between duplication and incomplete lineage sorting. We present the first algorithms for reconciling binary gene trees with non-binary species trees under a duplication-loss parsimony model. Our algorithms utilize an efficient mapping from gene to species trees to infer the minimum number of duplications in O(|V(G) | x (k(S) + h(S))) time, where |V(G)| is the number of nodes in the gene tree, h(S) is the height of the species tree and k(S) is the size of its largest polytomy. We present a dynamic programming algorithm which also minimizes the total number of losses. Although this algorithm is exponential in the size of the largest polytomy, it performs well in practice for polytomies with outdegree of 12 or less. We also present a heuristic which estimates the minimal number of losses in polynomial time. In empirical tests, this algorithm finds an optimal loss history 99% of the time. Our algorithms have been implemented in NOTUNG, a robust, production quality, tree-fitting program, which provides a graphical user interface for exploratory analysis and also supports automated, high-throughput analysis of large data sets.