RESUMO
Gene gain-loss-duplication models are commonly based on continuous-time birth-death processes. Employed in a phylogenetic context, such models have been increasingly popular in studies of gene content evolution across multiple genomes. While the applications are becoming more varied and demanding, bioinformatics methods for probabilistic inference on copy numbers (or integer-valued evolutionary characters, in general) are scarce. We describe a flexible probabilistic framework for phylogenetic gain-loss-duplication models. The framework is based on a novel elementary representation by dependent random variables with well-characterized conditional distributions: binomial, Pólya (negative binomial), and Poisson. The corresponding graphical model yields exact numerical procedures for computing the likelihood and the posterior distribution of ancestral copy numbers. The resulting algorithms take quadratic time in the total number of copies. In addition, we show how the likelihood gradient can be computed by a linear-time algorithm.
Assuntos
Algoritmos , Variações do Número de Cópias de DNA , Evolução Molecular , Duplicação Gênica , Modelos Genéticos , FilogeniaRESUMO
Shared genealogies introduce allele dependences in diploid genotypes, as alleles within an individual or between different individuals will likely match when they originate from a recent common ancestor. At a locus shared by a pair of diploid individuals, there are nine combinatorially distinct modes of identity-by-descent (IBD), capturing all possible combinations of coancestry and inbreeding. A distribution over the IBD modes is described by the nine associated probabilities, known as (Jacquard's) identity coefficients. The genetic relatedness between two individuals can be succinctly characterized by the identity coefficients corresponding to a pedigree that contains both individuals. The identity coefficients (together with allele frequencies) determine the distribution of joint genotypes at a locus. At a locus with two possible alleles, identity coefficients are not identifiable because different coefficients can generate the same genotype distribution. We analyze precisely how different IBD modes combine into identical genotype distributions at diallelic loci. In particular, we describe IBD mode mixtures that result in identical genotype distributions at all allele frequencies, implying the non-identifiability of the identity coefficients from independent loci. Our analysis yields an exhaustive characterization of relatedness statistics that are always identifiable. Importantly, we show that identifiable relatedness statistics include the kinship coefficient (probability that a random pair of alleles are identical by descent between individuals) and inbreeding-related measures, which can thus be estimated consistently from genotype distributions at independent loci.
Assuntos
Alelos , Genótipo , Humanos , Modelos Teóricos , ProbabilidadeRESUMO
The joint sequencing of related genomes has become an important means to discover rare variants. Normal-tumor genome pairs are routinely sequenced together to find somatic mutations and their associations with different cancers. Parental and sibling genomes reveal de novo germline mutations and inheritance patterns related to Mendelian diseases.Acute lymphoblastic leukemia (ALL) is the most common paediatric cancer and the leading cause of cancer-related death among children. With the aim of uncovering the full spectrum of germline and somatic genetic alterations in childhood ALL genomes, we conducted whole-exome re-sequencing on a unique cohort of over 120 exomes of childhood ALL quartets, each comprising a patient's tumor and matched-normal material, and DNA from both parents. We developed a general probabilistic model for such quartet sequencing reads mapped to the reference human genome. The model is used to infer joint genotypes at homologous loci across a normal-tumor genome pair and two parental genomes.We describe the algorithms and data structures for genotype inference, model parameter training. We implemented the methods in an open-source software package (QUADGT) that uses the standard file formats of the 1000 Genomes Project. Our method's utility is illustrated on quartets from the ALL cohort.
Assuntos
Análise Mutacional de DNA/métodos , Técnicas de Genotipagem , Mutação em Linhagem Germinativa , Mutação , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Algoritmos , Criança , Exoma , Genoma Humano , Genótipo , Humanos , SoftwareRESUMO
Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6-7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.
Assuntos
Eucariotos/genética , Genoma , Íntrons , Processamento Alternativo , Animais , Evolução Molecular , Humanos , Cadeias de MarkovRESUMO
SUMMARY: Count is a software package for the analysis of numerical profiles on a phylogeny. It is primarily designed to deal with profiles derived from the phyletic distribution of homologous gene families, but is suited to study any other integer-valued evolutionary characters. Count performs ancestral reconstruction, and infers family- and lineage-specific characteristics along the evolutionary tree. It implements popular methods employed in gene content analysis such as Dollo and Wagner parsimony, propensity for gene loss, as well as probabilistic methods involving a phylogenetic birth-and-death model. AVAILABILITY: Count is available as a stand-alone Java application, as well as an application bundle for MacOS X, at the web site http://www.iro.umontreal.ca/ approximately csuros/gene_content/count.html. It can also be launched using Java Webstart from the same site. The software is distributed under a BSD-style license. Source code is available upon request from the author.
Assuntos
Algoritmos , Biologia Computacional/métodos , Filogenia , Software , ProbabilidadeRESUMO
Homologous genes originate from a common ancestor through vertical inheritance, duplication, or horizontal gene transfer. Entire homolog families spawned by a single ancestral gene can be identified across multiple genomes based on protein sequence similarity. The sequences, however, do not always reveal conclusively the history of large families. To study the evolution of complete gene repertoires, we propose here a mathematical framework that does not rely on resolved gene family histories. We show that so-called phylogenetic profiles, formed by family sizes across multiple genomes, are sufficient to infer principal evolutionary trends. The main novelty in our approach is an efficient algorithm to compute the likelihood of a phylogenetic profile in a model of birth-and-death processes acting on a phylogeny. We examine known gene families in 28 archaeal genomes using a probabilistic model that involves lineage- and family-specific components of gene acquisition, duplication, and loss. The model enables us to consider all possible histories when inferring statistics about archaeal evolution. According to our reconstruction, most lineages are characterized by a net loss of gene families. Major increases in gene repertoire have occurred only a few times. Our reconstruction underlines the importance of persistent streamlining processes in shaping genome composition in Archaea. It also suggests that early archaeal genomes were as complex as typical modern ones, and even show signs, in the case of the methanogenic ancestor, of an extremely large gene repertoire.
Assuntos
Archaea/genética , Genoma Arqueal/genética , Modelos Genéticos , Filogenia , Substituição de Aminoácidos/genética , Sequência de Bases , Biologia Computacional , Evolução Molecular , Genes ArqueaisRESUMO
By conventional wisdom, a feature that occurs too often or too rarely in a genome can indicate a functional element. To infer functionality from frequency, it is crucial to precisely characterize occurrences in randomly evolving DNA. We find that the frequency of oligonucleotides in a genomic sequence follows primarily a Pareto-lognormal distribution, which encapsulates lognormal and power-law features found across all known genomes. Such a distribution could be the result of completely random evolution by a copying process. Our characterization of the entire frequency distribution of genomic words opens a way to a more accurate reasoning about their over- and underrepresentation in genomic sequences.
Assuntos
Genômica , Animais , Evolução Molecular , Duplicação Gênica , Genoma , Humanos , Cadeias de Markov , Oligonucleotídeos/metabolismoRESUMO
Comparison of the exon-intron structures of ancient eukaryotic paralogs reveals the absence of conserved intron positions in these genes. This is in contrast to the conservation of intron positions in orthologous genes from even the most evolutionarily distant eukaryotes and in more recent paralogs. The lack of conserved intron positions in ancient paralogs probably reflects the origination of these genes during the earliest phase of eukaryotic evolution, which was characterized by concomitant invasion of genes by group II self-splicing elements (which were to become introns in the future) and extensive duplication of genes.
Assuntos
Evolução Molecular , Íntrons , Animais , Duplicação Gênica , Humanos , Modelos GenéticosRESUMO
Chromalveolates are a large, diverse supergroup of unicellular eukaryotes that includes Apicomplexa, dinoflagellates, ciliates (three lineages that form the alveolate branch), heterokonts, haptophytes, and cryptomonads (three lineages comprising the chromist branch). All sequenced genomes of chromalveolates have relatively low intron density in protein-coding genes, and few intron positions are shared between chromalveolate lineages. In contrast, genes of different chromalveolates share many intron positions with orthologous genes from other eukaryotic supergroups, in particular, the intron-rich orthologs from animals and plants. Reconstruction of the history of intron gain and loss during the evolution of chromalveolates using a general and flexible maximum-likelihood approach indicates that genes of the ancestors of chromalveolates and, particularly, alveolates had unexpectedly high intron densities. It is estimated that the chromalveolate ancestor had, approximately, two-third of the human intron density, whereas the intron density in the genes of the alveolate ancestor is estimated to be slightly greater than the human intron density. Accordingly, it is inferred that the evolution of chromalveolates was dominated by intron loss. The conclusion that ancestral chromalveolate forms had high intron densities is unexpected because all extant unicellular eukaryotes have relatively few introns and are thought to be unable to maintain numerous introns due to intense purifying selection in their, typically, large populations. It is suggested that, at early stages of evolution, chromalveolates went through major population bottlenecks that were accompanied by intron invasion.
Assuntos
Eucariotos/genética , Evolução Molecular , Íntrons , Animais , Células Eucarióticas , Transferência Genética Horizontal , Genes de Protozoários , Funções Verossimilhança , Plantas/genéticaRESUMO
UNLABELLED: Malin is a software package for the analysis of eukaryotic gene structure evolution. It provides a graphical user interface for various tasks commonly used to infer the evolution of exon-intron structure in protein-coding orthologs. Implemented tasks include the identification of conserved homologous intron sites in protein alignments, as well as the estimation of ancestral intron content, lineage-specific intron losses and gains. Estimates are computed either with parsimony, or with a probabilistic model that incorporates rate variation across lineages and intron sites. AVAILABILITY: Malin is available as a stand-alone Java application, as well as an application bundle for MacOS X, at the website http://www.iro.umontreal.ca/~csuros/introns/malin/. The software is distributed under a BSD-style license.
Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Análise Mutacional de DNA/métodos , Evolução Molecular , Íntrons/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Sequência de Bases , Humanos , Funções Verossimilhança , Dados de Sequência MolecularRESUMO
UNLABELLED: Many fundamental questions concerning the emergence and subsequent evolution of eukaryotic exon-intron organization are still unsettled. Genome-scale comparative studies, which can shed light on crucial aspects of eukaryotic evolution, require adequate computational tools. We describe novel computational methods for studying spliceosomal intron evolution. Our goal is to give a reliable characterization of the dynamics of intron evolution. Our algorithmic innovations address the identification of orthologous introns, and the likelihood-based analysis of intron data. We discuss a compression method for the evaluation of the likelihood function, which is noteworthy for phylogenetic likelihood problems in general. We prove that after O(n l) preprocessing time, subsequent evaluations take O(n l/log l) time almost surely in the Yule-Harding random model of n-taxon phylogenies, where l is the input sequence length. We illustrate the practicality of our methods by compiling and analyzing a data set involving 18 eukaryotes, which is more than in any other study to date. The study yields the surprising result that ancestral eukaryotes were fairly intron-rich. For example, the bilaterian ancestor is estimated to have had more than 90% as many introns as vertebrates do now. AVAILABILITY: The Java implementations of the algorithms are publicly available from the corresponding author's site http://www.iro.umontreal.ca/~csuros/introns/. Data are available on request.
Assuntos
Mapeamento Cromossômico/métodos , Análise Mutacional de DNA/métodos , Evolução Molecular , Variação Genética/genética , Íntrons/genética , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Dados de Sequência MolecularRESUMO
We examine exon junctions near apparent amino acid insertions and deletions in alignments of orthologous protein-coding genes. In 1,917 ortholog families across nine oomycete genomes, 10-20% of introns are near an alignment gap, indicating at first sight that splice-site displacements are frequent. We designed a robust algorithmic procedure for the delineation of intron-containing homologous regions, and combined it with a parsimony-based reconstruction of intron loss, gain, and splice-site shift events on a phylogeny. The reconstruction implies that 12% of introns underwent an acceptor-site shift, and 10% underwent a donor-site shift. In order to offset gene annotation problems, we amended the procedure with the reannotation of intron boundaries using alignment evidence. The corresponding reconstruction involves much fewer intron gain and splice-site shift events. The frequency of acceptor- and donor-side shifts drops to 4% and 3%, respectively, which are not much different from what one would expect by random codon insertions and deletions. In other words, gaps near exon junctions are mostly artifacts of gene annotation rather than evidence of sliding intron boundaries. Our study underscores the importance of using well-supported gene structure annotations in comparative studies. When transcription evidence is not available, we propose a robust ancestral reconstruction procedure that corrects misannotated intron boundaries using sequence alignments. The results corroborate the view that boundary shifts and complete intron sliding are only accidental in eukaryotic genome evolution and have a negligible impact on protein diversity.
Assuntos
Evolução Molecular , Oomicetos/genética , Filogenia , Sítios de Splice de RNA/genética , Sequência de Aminoácidos/genética , Eucariotos/genética , Éxons/genética , Genoma , Mutação INDEL/genética , Íntrons/genética , Anotação de Sequência Molecular , Alinhamento de Sequência , Homologia de SequênciaRESUMO
We present a novel distance-based algorithm for evolutionary tree reconstruction. Our algorithm reconstructs the topology of a tree with n leaves in O(n(2)) time using O(n) working space. In the general Markov model of evolution, the algorithm recovers the topology successfully with (1 - o(1)) probability from sequences with polynomial length in n. Moreover, for almost all trees, our algorithm achieves the same success probability on polylogarithmic sample sizes. The theoretical results are supported by simulation experiments involving trees with 500, 1,895, and 3,135 leaves. The topologies of the trees are recovered with high success from 2,000 bp DNA sequences.
Assuntos
Algoritmos , Evolução Molecular , Sequência de Bases , Biologia Computacional , DNA/genética , Cadeias de Markov , Modelos Genéticos , FilogeniaRESUMO
Pooled Genomic Indexing (PGI) is a novel method for physical mapping of clones onto known sequences. PGI is carried out by pooling arrayed clones and generating shotgun sequence reads from the pools. The shotgun sequences are compared to a reference sequence. In the simplest case, clones are placed on an array and are pooled by rows and columns. If a shotgun sequence from a row pool and another shotgun sequence from a column pool match the reference sequence at a close distance, they are both assigned to the clone at the intersection of the two pools. Accordingly, the clone is mapped onto the region of the reference sequence between the two matches. A probabilistic model for PGI is developed, and several pooling designs are described and analyzed, including transversal designs and designs from linear codes. The probabilistic model and the pooling schemes are validated in simulated experiments where 625 rat bacterial artificial chromosome (BAC) clones and 207 mouse BAC clones are mapped onto homologous human sequence.
Assuntos
Biologia Computacional , Mapeamento Físico do Cromossomo , Projetos de Pesquisa , Animais , Cromossomos Artificiais Bacterianos , Interpretação Estatística de Dados , Camundongos , Filogenia , Probabilidade , RatosRESUMO
We examine the problem of finding maximum-scoring sets of disjoint segments in a sequence of scores. The problem arises in DNA and protein segmentation and in postprocessing of sequence alignments. Our key result states a simple recursive relationship between maximum-scoring segment sets. The statement leads to fast algorithms for finding such segment sets. We apply our methods to the identification of noncoding RNA genes in thermophiles.
Assuntos
Biologia Computacional/métodos , DNA/química , Alinhamento de Sequência , Algoritmos , Animais , Interpretação Estatística de Dados , Humanos , Funções Verossimilhança , Mathanococcus/genética , Modelos Estatísticos , Modelos Teóricos , Probabilidade , Sulfolobus/genéticaRESUMO
This paper studies sequencing and mapping methods that rely solely on pooling and shotgun sequencing of clones. First, we scrutinize and improve the recently proposed Clone-Array Pooled Shotgun Sequencing (CAPSS) method, which delivers a BAC-linked assembly of a whole genome sequence. Secondly, we introduce a novel physical mapping method, called Clone-Array Pooled Shotgun Mapping (CAPS-MAP), which computes the physical ordering of BACs in a random library. Both CAPSS and CAPS-MAP construct subclone libraries from pooled genomic BAC clones.
Assuntos
Biologia Computacional/métodos , Animais , Cromossomos Artificiais Bacterianos , Simulação por Computador , Drosophila melanogaster/genética , Genoma , Reprodutibilidade dos Testes , Projetos de Pesquisa , Alinhamento de SequênciaRESUMO
In eukaryotes, protein-coding sequences are interrupted by non-coding sequences known as introns. During mRNA maturation, introns are excised by the spliceosome and the coding regions, exons, are spliced to form the mature coding region. The intron densities widely differ between eukaryotic lineages, from 6 to 7 introns per kb of coding sequence in vertebrates, some invertebrates and green plants, to only a few introns across the entire genome in many unicellular eukaryotes. Evolutionary reconstructions using maximum likelihood methods suggest intron-rich ancestors for each major group of eukaryotes. For the last common ancestor of animals, the highest intron density of all extant and extinct eukaryotes was inferred, at 120-130% of the human intron density. Furthermore, an intron density within 53-74% of the human values was inferred for the last eukaryotic common ancestor. Accordingly, evolution of eukaryotic genes in all lines of descent involved primarily intron loss, with substantial gain only at the bases of several branches including plants and animals. These conclusions have substantial biological implications indicating that the common ancestor of all modern eukaryotes was a complex organism with a gene architecture resembling those in multicellular organisms. Alternative splicing most likely initially appeared as an inevitable result of splicing errors and only later was employed to generate structural and functional diversification of proteins.
Assuntos
Eucariotos/genética , Éxons/genética , Íntrons/genética , Processamento Alternativo , Animais , Evolução Biológica , Sequência Conservada/genética , Evolução Molecular , Genoma , Humanos , Spliceossomos/genéticaRESUMO
UNLABELLED: Marine bacteria in the Roseobacter and SAR11 lineages successfully exploit the ocean habitat, together accounting for ~40% of bacteria in surface waters, yet have divergent life histories that exemplify patch-adapted versus free-living ecological roles. Here, we use a phylogenetic birth-and-death model to understand how genome content supporting different life history strategies evolved in these related alphaproteobacterial taxa, showing that the streamlined genomes of free-living SAR11 were gradually downsized from a common ancestral genome only slightly larger than the extant members (~2,000 genes), while the larger and variably sized genomes of roseobacters evolved along dynamic pathways from a sizeable common ancestor (~8,000 genes). Genome changes in the SAR11 lineage occurred gradually over ~800 million years, whereas Roseobacter genomes underwent more substantial modifications, including major periods of expansion, over ~260 million years. The timing of the first Roseobacter genome expansion was coincident with the predicted radiation of modern marine eukaryotic phytoplankton of sufficient size to create nutrient-enriched microzones and is consistent with present-day ecological associations between these microbial groups. We suggest that diversification of red-lineage phytoplankton is an important driver of divergent life history strategies among the heterotrophic bacterioplankton taxa that dominate the present-day ocean. IMPORTANCE: One-half of global primary production occurs in the oceans, and more than half of this is processed by heterotrophic bacterioplankton through the marine microbial food web. The diversity of life history strategies that characterize different bacterioplankton taxa is an important subject, since the locations and mechanisms whereby bacteria interact with seawater organic matter has effects on microbial growth rates, metabolic pathways, and growth efficiencies, and these in turn affect rates of carbon mineralization to the atmosphere and sequestration into the deep sea. Understanding the evolutionary origins of the ecological strategies that underlie biochemical interactions of bacteria with the ocean system, and which scale up to affect globally important biogeochemical processes, will improve understanding of how microbial diversity is maintained and enable useful predictions about microbial response in the future ocean.
Assuntos
Alphaproteobacteria/genética , Organismos Aquáticos/genética , Evolução Molecular , Água do Mar/microbiologia , Genoma BacterianoAssuntos
Células Eucarióticas/metabolismo , Evolução Molecular , Íntrons/genética , Modelos Genéticos , Análise de Sequência de DNA/métodos , Animais , Simulação por Computador , Sequência Conservada/genética , Humanos , Funções Verossimilhança , Modelos Estatísticos , Alinhamento de Sequência/métodos , Spliceossomos/metabolismoRESUMO
Evolution of exon-intron structure of eukaryotic genes has been a matter of long-standing, intensive debate. The introns-early concept, later rebranded 'introns first' held that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. The introns-late concept held that introns emerged only in eukaryotes and new introns have been accumulating continuously throughout eukaryotic evolution. Analysis of orthologous genes from completely sequenced eukaryotic genomes revealed numerous shared intron positions in orthologous genes from animals and plants and even between animals, plants and protists, suggesting that many ancestral introns have persisted since the last eukaryotic common ancestor (LECA). Reconstructions of intron gain and loss using the growing collection of genomes of diverse eukaryotes and increasingly advanced probabilistic models convincingly show that the LECA and the ancestors of each eukaryotic supergroup had intron-rich genes, with intron densities comparable to those in the most intron-rich modern genomes such as those of vertebrates. The subsequent evolution in most lineages of eukaryotes involved primarily loss of introns, with only a few episodes of substantial intron gain that might have accompanied major evolutionary innovations such as the origin of metazoa. The original invasion of self-splicing Group II introns, presumably originating from the mitochondrial endosymbiont, into the genome of the emerging eukaryote might have been a key factor of eukaryogenesis that in particular triggered the origin of endomembranes and the nucleus. Conversely, splicing errors gave rise to alternative splicing, a major contribution to the biological complexity of multicellular eukaryotes. There is no indication that any prokaryote has ever possessed a spliceosome or introns in protein-coding genes, other than relatively rare mobile self-splicing introns. Thus, the introns-first scenario is not supported by any evidence but exon-intron structure of protein-coding genes appears to have evolved concomitantly with the eukaryotic cell, and introns were a major factor of evolution throughout the history of eukaryotes.