Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
Article in English | MEDLINE | ID: mdl-27337980

ABSTRACT

The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html.


Subject(s)
Databases, Nucleic Acid , Databases, Protein , Internet , Molecular Sequence Annotation/methods , Animals , Humans , Mice
4.
Mol Ecol ; 25(9): 2015-28, 2016 05.
Article in English | MEDLINE | ID: mdl-26928872

ABSTRACT

Relatively little is known about the character of gene expression evolution as species diverge. It is for instance unclear if gene expression generally evolves in a clock-like manner (by stabilizing selection or neutral evolution) or if there are frequent episodes of directional selection. To gain insights into the evolutionary divergence of gene expression, we sequenced and compared the transcriptomes of multiple organs from population samples of collared (Ficedula albicollis) and pied flycatchers (F. hypoleuca), two species which diverged less than one million years ago. Ordination analysis separated samples by organ rather than by species. Organs differed in their degrees of expression variance within species and expression divergence between species. Variance was negatively correlated with expression breadth and protein interactivity, suggesting that pleiotropic constraints reduce gene expression variance within species. Variance was correlated with between-species divergence, consistent with a pattern expected from stabilizing selection and neutral evolution. Using an expression PST approach, we identified genes differentially expressed between species and found 16 genes uniquely expressed in one of the species. For one of these, DPP7, uniquely expressed in collared flycatcher, the absence of expression in pied flycatcher could be associated with a ≈20-kb deletion including 11 of 13 exons. This study of a young vertebrate speciation model system expands our knowledge of how gene expression evolves as natural populations become reproductively isolated.


Subject(s)
Biological Evolution , Genetic Drift , Selection, Genetic , Songbirds/classification , Animals , Female , Gene Expression , Genetic Pleiotropy , Genetics, Population , Male , Models, Genetic , Songbirds/genetics , Species Specificity , Sweden
5.
Nat Genet ; 48(4): 427-37, 2016 Apr.
Article in English | MEDLINE | ID: mdl-26950095

ABSTRACT

To connect human biology to fish biomedical models, we sequenced the genome of spotted gar (Lepisosteus oculatus), whose lineage diverged from teleosts before teleost genome duplication (TGD). The slowly evolving gar genome has conserved in content and size many entire chromosomes from bony vertebrate ancestors. Gar bridges teleosts to tetrapods by illuminating the evolution of immunity, mineralization and development (mediated, for example, by Hox, ParaHox and microRNA genes). Numerous conserved noncoding elements (CNEs; often cis regulatory) undetectable in direct human-teleost comparisons become apparent using gar: functional studies uncovered conserved roles for such cryptic CNEs, facilitating annotation of sequences identified in human genome-wide association studies. Transcriptomic analyses showed that the sums of expression domains and expression levels for duplicated teleost genes often approximate the patterns and levels of expression for gar genes, consistent with subfunctionalization. The gar genome provides a resource for understanding evolution after genome duplication, the origin of vertebrate genomes and the function of human regulatory sequences.


Subject(s)
Fishes/genetics , Animals , Evolution, Molecular , Female , Fishes/metabolism , Genome , Humans , Karyotype , Models, Genetic , Organ Specificity , Sequence Analysis, DNA , Transcriptome
6.
Article in English | MEDLINE | ID: mdl-26896847

ABSTRACT

Evolution provides the unifying framework with which to understand biology. The coherent investigation of genic and genomic data often requires comparative genomics analyses based on whole-genome alignments, sets of homologous genes and other relevant datasets in order to evaluate and answer evolutionary-related questions. However, the complexity and computational requirements of producing such data are substantial: this has led to only a small number of reference resources that are used for most comparative analyses. The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available. Database URL: http://www.ensembl.org.


Subject(s)
Computational Biology/methods , Genome , Genomics , Algorithms , Animals , DNA, Complementary/genetics , Databases, Genetic , Evolution, Molecular , Expressed Sequence Tags , Humans , Phylogeny , Quality Control , RNA, Untranslated/genetics , Sequence Alignment , Sequence Analysis, RNA , Software
7.
Nucleic Acids Res ; 43(Database issue): D662-9, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25352552

ABSTRACT

Ensembl (http://www.ensembl.org) is a genomic interpretation system providing the most up-to-date annotations, querying tools and access methods for chordates and key model organisms. This year we released updated annotation (gene models, comparative genomics, regulatory regions and variation) on the new human assembly, GRCh38, although we continue to support researchers using the GRCh37.p13 assembly through a dedicated site (http://grch37.ensembl.org). Our Regulatory Build has been revamped to identify regulatory regions of interest and to efficiently highlight their activity across disparate epigenetic data sets. A number of new interfaces allow users to perform large-scale comparisons of their data against our annotations. The REST server (http://rest.ensembl.org), which allows programs written in any language to query our databases, has moved to a full service alongside our upgraded website tools. Our online Variant Effect Predictor tool has been updated to process more variants and calculate summary statistics. Lastly, the WiggleTools package enables users to summarize large collections of data sets and view them as single tracks in Ensembl. The Ensembl code base itself is more accessible: it is now hosted on our GitHub organization page (https://github.com/Ensembl) under an Apache 2.0 open source license.


Subject(s)
Databases, Nucleic Acid , Genomics , Animals , Epigenesis, Genetic , Genetic Variation , Genome, Human , Humans , Internet , Mice , Molecular Sequence Annotation , Regulatory Sequences, Nucleic Acid , Software
8.
Science ; 344(6188): 1168-1173, 2014 Jun 06.
Article in English | MEDLINE | ID: mdl-24904168

ABSTRACT

Sheep (Ovis aries) are a major source of meat, milk, and fiber in the form of wool and represent a distinct class of animals that have a specialized digestive organ, the rumen, that carries out the initial digestion of plant material. We have developed and analyzed a high-quality reference sheep genome and transcriptomes from 40 different tissues. We identified highly expressed genes encoding keratin cross-linking proteins associated with rumen evolution. We also identified genes involved in lipid metabolism that had been amplified and/or had altered tissue expression patterns. This may be in response to changes in the barrier lipids of the skin, an interaction between lipid metabolism and wool synthesis, and an increased role of volatile fatty acids in ruminants compared with nonruminant animals.


Subject(s)
Lipid Metabolism/physiology , Rumen/physiology , Sheep, Domestic/genetics , Sheep, Domestic/metabolism , Amino Acid Sequence , Animals , Fatty Acids, Volatile/metabolism , Fatty Acids, Volatile/physiology , Gene Expression Regulation , Genome , Keratins, Hair-Specific/genetics , Lipid Metabolism/genetics , Molecular Sequence Data , Phylogeny , Rumen/metabolism , Sheep, Domestic/classification , Transcriptome , Wool/growth & development
9.
Nucleic Acids Res ; 42(Database issue): D865-72, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24217909

ABSTRACT

The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.


Subject(s)
Databases, Genetic , Proteins/genetics , Animals , Exons , Genomics , Humans , Internet , Mice , Molecular Sequence Annotation , Sequence Analysis
10.
Nat Genet ; 45(4): 415-21, 421e1-2, 2013 Apr.
Article in English | MEDLINE | ID: mdl-23435085

ABSTRACT

Lampreys are representatives of an ancient vertebrate lineage that diverged from our own ∼500 million years ago. By virtue of this deeply shared ancestry, the sea lamprey (P. marinus) genome is uniquely poised to provide insight into the ancestry of vertebrate genomes and the underlying principles of vertebrate biology. Here, we present the first lamprey whole-genome sequence and assembly. We note challenges faced owing to its high content of repetitive elements and GC bases, as well as the absence of broad-scale sequence information from closely related species. Analyses of the assembly indicate that two whole-genome duplications likely occurred before the divergence of ancestral lamprey and gnathostome lineages. Moreover, the results help define key evolutionary events within vertebrate lineages, including the origin of myelin-associated proteins and the development of appendages. The lamprey genome provides an important resource for reconstructing vertebrate origins and the evolutionary events that have shaped the genomes of extant organisms.


Subject(s)
Chromosome Mapping , Evolution, Molecular , Genome , Petromyzon/genetics , Vertebrates/genetics , Animals , Phylogeny , Repetitive Sequences, Nucleic Acid , Sequence Analysis, DNA
11.
Genome Res ; 22(10): 2067-78, 2012 Oct.
Article in English | MEDLINE | ID: mdl-22798491

ABSTRACT

Ensembl gene annotation provides a comprehensive catalog of transcripts aligned to the reference sequence. It relies on publicly available species-specific and orthologous transcripts plus their inferred protein sequence. The accuracy of gene models is improved by increasing the species-specific component that can be cost-effectively achieved using RNA-seq. Two zebrafish gene annotations are presented in Ensembl version 62 built on the Zv9 reference sequence. Firstly, RNA-seq data from five tissues and seven developmental stages were assembled into 25,748 gene models. A 3'-end capture and sequencing protocol was developed to predict the 3' ends of transcripts, and 46.1% of the original models were subsequently refined. Secondly, a standard Ensembl genebuild, incorporating carefully filtered elements from the RNA-seq-only build, followed by a merge with the manually curated VEGA database, produced a comprehensive annotation of 26,152 genes represented by 51,569 transcripts. The RNA-seq-only and the Ensembl/VEGA genebuilds contribute contrasting elements to the final genebuild. The RNA-seq genebuild was used to adjust intron/exon boundaries of orthologous defined models, confirm their expression, and improve 3' untranslated regions. Importantly, the inferred protein alignments within the Ensembl genebuild conferred proof of model contiguity for the RNA-seq models. The zebrafish gene annotation has been enhanced by the incorporation of RNA-seq data and the pipeline will be used for other organisms. Organisms with little species-specific cDNA data will generally benefit the most.


Subject(s)
Computational Biology/methods , Databases, Nucleic Acid , Molecular Sequence Annotation , RNA/chemistry , Zebrafish/genetics , 3' Untranslated Regions , Animals , DNA, Complementary , Exons , Genomics/methods , Introns , Male , Models, Genetic , RNA/genetics , Transcription, Genetic
12.
Cell ; 148(4): 780-91, 2012 Feb 17.
Article in English | MEDLINE | ID: mdl-22341448

ABSTRACT

The Tasmanian devil (Sarcophilus harrisii), the largest marsupial carnivore, is endangered due to a transmissible facial cancer spread by direct transfer of living cancer cells through biting. Here we describe the sequencing, assembly, and annotation of the Tasmanian devil genome and whole-genome sequences for two geographically distant subclones of the cancer. Genomic analysis suggests that the cancer first arose from a female Tasmanian devil and that the clone has subsequently genetically diverged during its spread across Tasmania. The devil cancer genome contains more than 17,000 somatic base substitution mutations and bears the imprint of a distinct mutational process. Genotyping of somatic mutations in 104 geographically and temporally distributed Tasmanian devil tumors reveals the pattern of evolution and spread of this parasitic clonal lineage, with evidence of a selective sweep in one geographical area and persistence of parallel lineages in other populations.


Subject(s)
Facial Neoplasms/veterinary , Genomic Instability , Marsupialia/genetics , Mutation , Animals , Clonal Evolution , Endangered Species , Facial Neoplasms/epidemiology , Facial Neoplasms/genetics , Facial Neoplasms/pathology , Female , Genome-Wide Association Study , Male , Molecular Sequence Data , Tasmania/epidemiology
13.
Genome Biol ; 12(8): R81, 2011 Aug 29.
Article in English | MEDLINE | ID: mdl-21854559

ABSTRACT

BACKGROUND: We present the genome sequence of the tammar wallaby, Macropus eugenii, which is a member of the kangaroo family and the first representative of the iconic hopping mammals that symbolize Australia to be sequenced. The tammar has many unusual biological characteristics, including the longest period of embryonic diapause of any mammal, extremely synchronized seasonal breeding and prolonged and sophisticated lactation within a well-defined pouch. Like other marsupials, it gives birth to highly altricial young, and has a small number of very large chromosomes, making it a valuable model for genomics, reproduction and development. RESULTS: The genome has been sequenced to 2 × coverage using Sanger sequencing, enhanced with additional next generation sequencing and the integration of extensive physical and linkage maps to build the genome assembly. We also sequenced the tammar transcriptome across many tissues and developmental time points. Our analyses of these data shed light on mammalian reproduction, development and genome evolution: there is innovation in reproductive and lactational genes, rapid evolution of germ cell genes, and incomplete, locus-specific X inactivation. We also observe novel retrotransposons and a highly rearranged major histocompatibility complex, with many class I genes located outside the complex. Novel microRNAs in the tammar HOX clusters uncover new potential mammalian HOX regulatory elements. CONCLUSIONS: Analyses of these resources enhance our understanding of marsupial gene evolution, identify marsupial-specific conserved non-coding elements and critical genes across a range of biological systems, including reproduction, development and immunity, and provide new insight into marsupial and mammalian biology and genome evolution.


Subject(s)
Biological Evolution , Macropodidae/classification , Macropodidae/genetics , Transcriptome/genetics , Animals , Australia , Chromosome Mapping , Chromosomes, Mammalian/genetics , Female , Gene Expression Regulation , Genome , Genomic Imprinting , In Situ Hybridization, Fluorescence , Macropodidae/growth & development , MicroRNAs/genetics , MicroRNAs/metabolism , Molecular Sequence Data , Reproduction/genetics , Sequence Alignment , Sequence Analysis, DNA
14.
Nature ; 447(7141): 167-77, 2007 May 10.
Article in English | MEDLINE | ID: mdl-17495919

ABSTRACT

We report a high-quality draft of the genome sequence of the grey, short-tailed opossum (Monodelphis domestica). As the first metatherian ('marsupial') species to be sequenced, the opossum provides a unique perspective on the organization and evolution of mammalian genomes. Distinctive features of the opossum chromosomes provide support for recent theories about genome evolution and function, including a strong influence of biased gene conversion on nucleotide sequence composition, and a relationship between chromosomal characteristics and X chromosome inactivation. Comparison of opossum and eutherian genomes also reveals a sharp difference in evolutionary innovation between protein-coding and non-coding functional elements. True innovation in protein-coding genes seems to be relatively rare, with lineage-specific differences being largely due to diversification and rapid turnover in gene families involved in environmental interactions. In contrast, about 20% of eutherian conserved non-coding elements (CNEs) are recent inventions that postdate the divergence of Eutheria and Metatheria. A substantial proportion of these eutherian-specific CNEs arose from sequence inserted by transposable elements, pointing to transposons as a major creative force in the evolution of mammalian gene regulation.


Subject(s)
Evolution, Molecular , Genome/genetics , Genomics , Opossums/genetics , Animals , Base Composition , Conserved Sequence/genetics , DNA Transposable Elements/genetics , Humans , Polymorphism, Single Nucleotide/genetics , Protein Biosynthesis , Synteny/genetics , X Chromosome Inactivation/genetics
15.
Nature ; 438(7069): 803-19, 2005 Dec 08.
Article in English | MEDLINE | ID: mdl-16341006

ABSTRACT

Here we report a high-quality draft genome sequence of the domestic dog (Canis familiaris), together with a dense map of single nucleotide polymorphisms (SNPs) across breeds. The dog is of particular interest because it provides important evolutionary information and because existing breeds show great phenotypic diversity for morphological, physiological and behavioural traits. We use sequence comparison with the primate and rodent lineages to shed light on the structure and evolution of genomes and genes. Notably, the majority of the most highly conserved non-coding sequences in mammalian genomes are clustered near a small subset of genes with important roles in development. Analysis of SNPs reveals long-range haplotypes across the entire dog genome, and defines the nature of genetic diversity within and across breeds. The current SNP map now makes it possible for genome-wide association studies to identify genes responsible for diseases and traits, with important consequences for human and companion animal health.


Subject(s)
Dogs/genetics , Evolution, Molecular , Genome/genetics , Genomics , Haplotypes/genetics , Animals , Conserved Sequence/genetics , Dog Diseases/genetics , Dogs/classification , Female , Humans , Hybridization, Genetic , Male , Mice , Mutagenesis/genetics , Polymorphism, Single Nucleotide/genetics , Rats , Short Interspersed Nucleotide Elements/genetics , Synteny/genetics
16.
Genome Res ; 14(5): 934-41, 2004 May.
Article in English | MEDLINE | ID: mdl-15123589

ABSTRACT

The Ensembl pipeline is an extension to the Ensembl system which allows automated annotation of genomic sequence. The software comprises two parts. First, there is a set of Perl modules ("Runnables" and "RunnableDBs") which are 'wrappers' for a variety of commonly used analysis tools. These retrieve sequence data from a relational database, run the analysis, and write the results back to the database. They inherit from a common interface, which simplifies the writing of new wrapper modules. On top of this sits a job submission system (the "RuleManager") which allows efficient and reliable submission of large numbers of jobs to a compute farm. Here we describe the fundamental software components of the pipeline, and we also highlight some features of the Sanger installation which were necessary to enable the pipeline to scale to whole-genome analysis.


Subject(s)
Computational Biology/methods , Base Sequence/genetics , DNA/genetics , Databases, Genetic/standards , Programming Languages , Proteins/classification , Software , Software Design
17.
Genome Res ; 14(5): 963-70, 2004 May.
Article in English | MEDLINE | ID: mdl-15123593

ABSTRACT

With the completion of the human genome sequence and genome sequence available for other vertebrate genomes, the task of manual annotation at the large genome scale has become a priority. Possibly even more important, is the requirement to curate and improve this annotation in the light of future data. For this to be possible, there is a need for tools to access and manage the annotation. Ensembl provides an excellent means for storing gene structures, genome features, and sequence, but it does not support the extra textual data necessary for manual annotation. We have extended Ensembl to create the Otter manual annotation system. This comprises a relational database schema for storing the manual annotation data, an application-programming interface (API) to access it, an extensible markup language (XML) format to allow transfer of the data, and a server to allow multiuser/multimachine access to the data. We have also written a data-adaptor plugin for the Apollo Browser/Editor to enable it to utilize an Otter server. The otter database is currently used by the Vertebrate Genome Annotation (VEGA) site (http://vega.sanger.ac.uk), which provides access to manually curated human chromosomes. Support is also being developed for using the AceDB annotation editor, FMap, via a perl wrapper called Lace. The Human and Vertebrate Annotation (HAVANA) group annotators at the Sanger center are using this to annotate human chromosomes 1 and 20.


Subject(s)
Software , Computational Biology/methods , Databases, Genetic , Genes/physiology , Genome, Human , Humans , Online Systems
18.
BMC Bioinformatics ; 4: 47, 2003 Oct 10.
Article in English | MEDLINE | ID: mdl-14552658

ABSTRACT

BACKGROUND: The alignment of two or more protein sequences provides a powerful guide in the prediction of the protein structure and in identifying key functional residues, however, the utility of any prediction is completely dependent on the accuracy of the alignment. In this paper we describe a suite of reference alignments derived from the comparison of protein three-dimensional structures together with evaluation measures and software that allow automatically generated alignments to be benchmarked. We test the OXBench benchmark suite on alignments generated by the AMPS multiple alignment method, then apply the suite to compare eight different multiple alignment algorithms. The benchmark shows the current state-of-the art for alignment accuracy and provides a baseline against which new alignment algorithms may be judged. RESULTS: The simple hierarchical multiple alignment algorithm, AMPS, performed as well as or better than more modern methods such as CLUSTALW once the PAM250 pair-score matrix was replaced by a BLOSUM series matrix. AMPS gave an accuracy in Structurally Conserved Regions (SCRs) of 89.9% over a set of 672 alignments. The T-COFFEE method on a data set of families with <8 sequences gave 91.4% accuracy, significantly better than CLUSTALW (88.9%) and all other methods considered here. The complete suite is available from http://www.compbio.dundee.ac.uk. CONCLUSIONS: The OXBench suite of reference alignments, evaluation software and results database provide a convenient method to assess progress in sequence alignment techniques. Evaluation measures that were dependent on comparison to a reference alignment were found to give good discrimination between methods. The STAMP Sc Score which is independent of a reference alignment also gave good discrimination. Application of OXBench in this paper shows that with the exception of T-COFFEE, the majority of the improvement in alignment accuracy seen since 1985 stems from improved pair-score matrices rather than algorithmic refinements. The maximum theoretical alignment accuracy obtained by pooling results over all methods was 94.5% with 52.5% accuracy for alignments in the 0-10 percentage identity range. This suggests that further improvements in accuracy will be possible in the future.


Subject(s)
Benchmarking/methods , Proteins/chemistry , Sequence Alignment/methods , Sequence Alignment/standards , Software , Amino Acid Sequence , Benchmarking/statistics & numerical data , Cluster Analysis , Computational Biology/methods , Computational Biology/standards , Computational Biology/statistics & numerical data , Computer Graphics/standards , Computer Graphics/statistics & numerical data , Conserved Sequence , Databases, Protein , Ferredoxins/chemistry , Internet , Molecular Sequence Data , Reproducibility of Results , Sequence Alignment/statistics & numerical data , Sequence Homology, Amino Acid , Software Design , Software Validation , Statistical Distributions
SELECTION OF CITATIONS
SEARCH DETAIL
...