RESUMEN
The Tasmanian devil (Sarcophilus harrisii), the largest marsupial carnivore, is endangered due to a transmissible facial cancer spread by direct transfer of living cancer cells through biting. Here we describe the sequencing, assembly, and annotation of the Tasmanian devil genome and whole-genome sequences for two geographically distant subclones of the cancer. Genomic analysis suggests that the cancer first arose from a female Tasmanian devil and that the clone has subsequently genetically diverged during its spread across Tasmania. The devil cancer genome contains more than 17,000 somatic base substitution mutations and bears the imprint of a distinct mutational process. Genotyping of somatic mutations in 104 geographically and temporally distributed Tasmanian devil tumors reveals the pattern of evolution and spread of this parasitic clonal lineage, with evidence of a selective sweep in one geographical area and persistence of parallel lineages in other populations.
Asunto(s)
Neoplasias Faciales/veterinaria , Inestabilidad Genómica , Marsupiales/genética , Mutación , Animales , Evolución Clonal , Especies en Peligro de Extinción , Neoplasias Faciales/epidemiología , Neoplasias Faciales/genética , Neoplasias Faciales/patología , Femenino , Estudio de Asociación del Genoma Completo , Masculino , Datos de Secuencia Molecular , Tasmania/epidemiologíaRESUMEN
Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 and 6 million yr ago, but that are absent in the Hominidae. Hominidae show between four- and sevenfold lower rates of nucleotide change and feature turnover in both neutral and functional sequences, suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. Recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli, which resulted in thousands of novel, species-specific CTCF binding sites. Our results show that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology.
Asunto(s)
Evolución Molecular , Genoma/genética , Muridae/genética , Filogenia , Animales , Sitios de Unión , Factor de Unión a CCCTC/genética , Cromosomas/genética , Cariotipificación/métodos , Elementos de Nucleótido Esparcido Largo/genética , Ratones , Retroelementos/genética , Especificidad de la EspecieRESUMEN
The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.
Asunto(s)
Bases de Datos Genéticas , Genoma Humano/genética , Genómica , Seudogenes/genética , Animales , Biología Computacional , Humanos , Internet , Ratones , Anotación de Secuencia Molecular , Programas InformáticosRESUMEN
Elephantids are the world's most iconic megafaunal family, yet there is no comprehensive genomic assessment of their relationships. We report a total of 14 genomes, including 2 from the American mastodon, which is an extinct elephantid relative, and 12 spanning all three extant and three extinct elephantid species including an â¼120,000-y-old straight-tusked elephant, a Columbian mammoth, and woolly mammoths. Earlier genetic studies modeled elephantid evolution via simple bifurcating trees, but here we show that interspecies hybridization has been a recurrent feature of elephantid evolution. We found that the genetic makeup of the straight-tusked elephant, previously placed as a sister group to African forest elephants based on lower coverage data, in fact comprises three major components. Most of the straight-tusked elephant's ancestry derives from a lineage related to the ancestor of African elephants while its remaining ancestry consists of a large contribution from a lineage related to forest elephants and another related to mammoths. Columbian and woolly mammoths also showed evidence of interbreeding, likely following a latitudinal cline across North America. While hybridization events have shaped elephantid history in profound ways, isolation also appears to have played an important role. Our data reveal nearly complete isolation between the ancestors of the African forest and savanna elephants for â¼500,000 y, providing compelling justification for the conservation of forest and savanna elephants as separate species.
Asunto(s)
Elefantes/genética , Mamuts/genética , Mastodontes/genética , Animales , Elefantes/clasificación , Evolución Molecular , Extinción Biológica , Fósiles , Flujo Génico , Genoma , Genómica/historia , Historia Antigua , Mamuts/clasificación , Mastodontes/clasificación , FilogeniaRESUMEN
Gibbons are small arboreal apes that display an accelerated rate of evolutionary chromosomal rearrangement and occupy a key node in the primate phylogeny between Old World monkeys and great apes. Here we present the assembly and analysis of a northern white-cheeked gibbon (Nomascus leucogenys) genome. We describe the propensity for a gibbon-specific retrotransposon (LAVA) to insert into chromosome segregation genes and alter transcription by providing a premature termination site, suggesting a possible molecular mechanism for the genome plasticity of the gibbon lineage. We further show that the gibbon genera (Nomascus, Hylobates, Hoolock and Symphalangus) experienced a near-instantaneous radiation â¼5 million years ago, coincident with major geographical changes in southeast Asia that caused cycles of habitat compression and expansion. Finally, we identify signatures of positive selection in genes important for forelimb development (TBX5) and connective tissues (COL1A1) that may have been involved in the adaptation of gibbons to their arboreal habitat.
Asunto(s)
Genoma/genética , Hylobates/clasificación , Hylobates/genética , Cariotipo , Filogenia , Animales , Evolución Molecular , Hominidae/clasificación , Hominidae/genética , Humanos , Datos de Secuencia Molecular , Retroelementos/genética , Selección Genética , Terminación de la Transcripción GenéticaRESUMEN
Cichlid fishes are famous for large, diverse and replicated adaptive radiations in the Great Lakes of East Africa. To understand the molecular mechanisms underlying cichlid phenotypic diversity, we sequenced the genomes and transcriptomes of five lineages of African cichlids: the Nile tilapia (Oreochromis niloticus), an ancestral lineage with low diversity; and four members of the East African lineage: Neolamprologus brichardi/pulcher (older radiation, Lake Tanganyika), Metriaclima zebra (recent radiation, Lake Malawi), Pundamilia nyererei (very recent radiation, Lake Victoria), and Astatotilapia burtoni (riverine species around Lake Tanganyika). We found an excess of gene duplications in the East African lineage compared to tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs. In addition, we analysed sequence data from sixty individuals representing six closely related species from Lake Victoria, and show genome-wide diversifying selection on coding and regulatory variants, some of which were recruited from ancient polymorphisms. We conclude that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.
Asunto(s)
Cíclidos/clasificación , Cíclidos/genética , Evolución Molecular , Especiación Genética , Genoma/genética , África Oriental , Animales , Elementos Transponibles de ADN/genética , Duplicación de Gen/genética , Regulación de la Expresión Génica/genética , Genómica , Lagos , MicroARNs/genética , Filogenia , Polimorfismo Genético/genéticaRESUMEN
The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.
Asunto(s)
Secuencia de Consenso , Bases de Datos Genéticas , Sistemas de Lectura Abierta , Animales , Curaduría de Datos/métodos , Curaduría de Datos/normas , Bases de Datos Genéticas/normas , Guías como Asunto , Humanos , Ratones , Anotación de Secuencia Molecular , National Library of Medicine (U.S.) , Estados Unidos , Interfaz Usuario-ComputadorRESUMEN
The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-finned fish was thought to have become extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features. Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues show the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
Asunto(s)
Evolución Biológica , Peces/clasificación , Peces/genética , Genoma/genética , Animales , Animales Modificados Genéticamente , Embrión de Pollo , Secuencia Conservada/genética , Elementos de Facilitación Genéticos/genética , Evolución Molecular , Extremidades/anatomía & histología , Extremidades/crecimiento & desarrollo , Peces/anatomía & histología , Peces/fisiología , Genes Homeobox/genética , Genómica , Inmunoglobulina M/genética , Ratones , Anotación de Secuencia Molecular , Datos de Secuencia Molecular , Filogenia , Alineación de Secuencia , Análisis de Secuencia de ADN , Vertebrados/anatomía & histología , Vertebrados/genética , Vertebrados/fisiologíaRESUMEN
Ensembl (www.ensembl.org) is a database and genome browser for enabling research on vertebrate genomes. We import, analyse, curate and integrate a diverse collection of large-scale reference data to create a more comprehensive view of genome biology than would be possible from any individual dataset. Our extensive data resources include evidence-based gene and regulatory region annotation, genome variation and gene trees. An accompanying suite of tools, infrastructure and programmatic access methods ensure uniform data analysis and distribution for all supported species. Together, these provide a comprehensive solution for large-scale and targeted genomics applications alike. Among many other developments over the past year, we have improved our resources for gene regulation and comparative genomics, and added CRISPR/Cas9 target sites. We released new browser functionality and tools, including improved filtering and prioritization of genome variation, Manhattan plot visualization for linkage disequilibrium and eQTL data, and an ontology search for phenotypes, traits and disease. We have also enhanced data discovery and access with a track hub registry and a selection of new REST end points. All Ensembl data are freely released to the scientific community and our source code is available via the open source Apache 2.0 license.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Genómica/métodos , Motor de Búsqueda , Programas Informáticos , Navegador Web , Animales , Minería de Datos , Evolución Molecular , Regulación de la Expresión Génica , Variación Genética , Genoma Humano , Humanos , Anotación de Secuencia Molecular , Especificidad de la Especie , VertebradosRESUMEN
We describe a genome reference of the African green monkey or vervet (Chlorocebus aethiops). This member of the Old World monkey (OWM) superfamily is uniquely valuable for genetic investigations of simian immunodeficiency virus (SIV), for which it is the most abundant natural host species, and of a wide range of health-related phenotypes assessed in Caribbean vervets (C. a. sabaeus), whose numbers have expanded dramatically since Europeans introduced small numbers of their ancestors from West Africa during the colonial era. We use the reference to characterize the genomic relationship between vervets and other primates, the intra-generic phylogeny of vervet subspecies, and genome-wide structural variations of a pedigreed C. a. sabaeus population. Through comparative analyses with human and rhesus macaque, we characterize at high resolution the unique chromosomal fission events that differentiate the vervets and their close relatives from most other catarrhine primates, in whom karyotype is highly conserved. We also provide a summary of transposable elements and contrast these with the rhesus macaque and human. Analysis of sequenced genomes representing each of the main vervet subspecies supports previously hypothesized relationships between these populations, which range across most of sub-Saharan Africa, while uncovering high levels of genetic diversity within each. Sequence-based analyses of major histocompatibility complex (MHC) polymorphisms reveal extremely low diversity in Caribbean C. a. sabaeus vervets, compared to vervets from putatively ancestral West African regions. In the C. a. sabaeus research population, we discover the first structural variations that are, in some cases, predicted to have a deleterious effect; future studies will determine the phenotypic impact of these variations.
Asunto(s)
Chlorocebus aethiops/genética , Genoma , Genómica , Animales , Chlorocebus aethiops/clasificación , Pintura Cromosómica , Biología Computacional/métodos , Evolución Molecular , Reordenamiento Génico , Variación Genética , Genómica/métodos , Cariotipo , Complejo Mayor de Histocompatibilidad/genética , Anotación de Secuencia Molecular , Filogenia , FilogeografíaRESUMEN
For 10,000 years pigs and humans have shared a close and complex relationship. From domestication to modern breeding practices, humans have shaped the genomes of domestic pigs. Here we present the assembly and analysis of the genome sequence of a female domestic Duroc pig (Sus scrofa) and a comparison with the genomes of wild and domestic pigs from Europe and Asia. Wild pigs emerged in South East Asia and subsequently spread across Eurasia. Our results reveal a deep phylogenetic split between European and Asian wild boars â¼1 million years ago, and a selective sweep analysis indicates selection on genes involved in RNA processing and regulation. Genes associated with immune response and olfaction exhibit fast evolution. Pigs have the largest repertoire of functional olfactory receptor genes, reflecting the importance of smell in this scavenging animal. The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model.
Asunto(s)
Genoma/genética , Filogenia , Sus scrofa/clasificación , Sus scrofa/genética , Animales , Demografía , Modelos Animales , Datos de Secuencia Molecular , Dinámica PoblacionalRESUMEN
The Ensembl project (http://www.ensembl.org) is a system for genome annotation, analysis, storage and dissemination designed to facilitate the access of genomic annotation from chordates and key model organisms. It provides access to data from 87 species across our main and early access Pre! websites. This year we introduced three newly annotated species and released numerous updates across our supported species with a concentration on data for the latest genome assemblies of human, mouse, zebrafish and rat. We also provided two data updates for the previous human assembly, GRCh37, through a dedicated website (http://grch37.ensembl.org). Our tools, in particular the VEP, have been improved significantly through integration of additional third party data. REST is now capable of larger-scale analysis and our regulatory data BioMart can deliver faster results. The website is now capable of displaying long-range interactions such as those found in cis-regulated datasets. Finally we have launched a website optimized for mobile devices providing views of genes, variants and phenotypes. Our data is made available without restriction and all code is available from our GitHub organization site (http://github.com/Ensembl) under an Apache 2.0 license.
Asunto(s)
Bases de Datos Genéticas , Genómica , Anotación de Secuencia Molecular , Animales , Genes , Variación Genética , Humanos , Internet , Ratones , Proteínas/genética , Ratas , Secuencias Reguladoras de Ácidos Nucleicos , Programas InformáticosRESUMEN
Atlantic cod (Gadus morhua) is a large, cold-adapted teleost that sustains long-standing commercial fisheries and incipient aquaculture. Here we present the genome sequence of Atlantic cod, showing evidence for complex thermal adaptations in its haemoglobin gene cluster and an unusual immune architecture compared to other sequenced vertebrates. The genome assembly was obtained exclusively by 454 sequencing of shotgun and paired-end libraries, and automated annotation identified 22,154 genes. The major histocompatibility complex (MHC) II is a conserved feature of the adaptive immune system of jawed vertebrates, but we show that Atlantic cod has lost the genes for MHC II, CD4 and invariant chain (Ii) that are essential for the function of this pathway. Nevertheless, Atlantic cod is not exceptionally susceptible to disease under natural conditions. We find a highly expanded number of MHC I genes and a unique composition of its Toll-like receptor (TLR) families. This indicates how the Atlantic cod immune system has evolved compensatory mechanisms in both adaptive and innate immunity in the absence of MHC II. These observations affect fundamental assumptions about the evolution of the adaptive immune system and its components in vertebrates.
Asunto(s)
Gadus morhua/genética , Gadus morhua/inmunología , Genoma/genética , Sistema Inmunológico/inmunología , Inmunidad/genética , Animales , Evolución Molecular , Genómica , Hemoglobinas/genética , Inmunidad/inmunología , Complejo Mayor de Histocompatibilidad/genética , Complejo Mayor de Histocompatibilidad/inmunología , Masculino , Polimorfismo Genético/genética , Sintenía/genética , Receptores Toll-Like/genéticaRESUMEN
Ensembl (http://www.ensembl.org) is a genomic interpretation system providing the most up-to-date annotations, querying tools and access methods for chordates and key model organisms. This year we released updated annotation (gene models, comparative genomics, regulatory regions and variation) on the new human assembly, GRCh38, although we continue to support researchers using the GRCh37.p13 assembly through a dedicated site (http://grch37.ensembl.org). Our Regulatory Build has been revamped to identify regulatory regions of interest and to efficiently highlight their activity across disparate epigenetic data sets. A number of new interfaces allow users to perform large-scale comparisons of their data against our annotations. The REST server (http://rest.ensembl.org), which allows programs written in any language to query our databases, has moved to a full service alongside our upgraded website tools. Our online Variant Effect Predictor tool has been updated to process more variants and calculate summary statistics. Lastly, the WiggleTools package enables users to summarize large collections of data sets and view them as single tracks in Ensembl. The Ensembl code base itself is more accessible: it is now hosted on our GitHub organization page (https://github.com/Ensembl) under an Apache 2.0 open source license.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Genómica , Animales , Epigénesis Genética , Variación Genética , Genoma Humano , Humanos , Internet , Ratones , Anotación de Secuencia Molecular , Secuencias Reguladoras de Ácidos Nucleicos , Programas InformáticosRESUMEN
Little is known about the genetic changes that distinguish domestic cat populations from their wild progenitors. Here we describe a high-quality domestic cat reference genome assembly and comparative inferences made with other cat breeds, wildcats, and other mammals. Based upon these comparisons, we identified positively selected genes enriched for genes involved in lipid metabolism that underpin adaptations to a hypercarnivorous diet. We also found positive selection signals within genes underlying sensory processes, especially those affecting vision and hearing in the carnivore lineage. We observed an evolutionary tradeoff between functional olfactory and vomeronasal receptor gene repertoires in the cat and dog genomes, with an expansion of the feline chemosensory system for detecting pheromones at the expense of odorant detection. Genomic regions harboring signatures of natural selection that distinguish domestic cats from their wild congeners are enriched in neural crest-related genes associated with behavior and reward in mouse models, as predicted by the domestication syndrome hypothesis. Our description of a previously unidentified allele for the gloving pigmentation pattern found in the Birman breed supports the hypothesis that cat breeds experienced strong selection on specific mutations drawn from random bred populations. Collectively, these findings provide insight into how the process of domestication altered the ancestral wildcat genome and build a resource for future disease mapping and phylogenomic studies across all members of the Felidae.
Asunto(s)
Animales Domésticos/genética , Animales Salvajes/genética , Gatos/genética , Genoma/genética , Genómica/métodos , Adaptación Fisiológica/genética , Secuencia de Aminoácidos , Animales , Carnivoría , Gatos/clasificación , Mapeo Cromosómico , Variaciones en el Número de Copia de ADN , Perros , Femenino , Eliminación de Gen , Duplicación de Gen , Masculino , Proteínas de Transporte de Membrana/clasificación , Proteínas de Transporte de Membrana/genética , Datos de Secuencia Molecular , Filogenia , Selección Genética/genética , Análisis de Secuencia de ADN , Homología de Secuencia de Aminoácido , Especificidad de la EspecieRESUMEN
Relatively little is known about the character of gene expression evolution as species diverge. It is for instance unclear if gene expression generally evolves in a clock-like manner (by stabilizing selection or neutral evolution) or if there are frequent episodes of directional selection. To gain insights into the evolutionary divergence of gene expression, we sequenced and compared the transcriptomes of multiple organs from population samples of collared (Ficedula albicollis) and pied flycatchers (F. hypoleuca), two species which diverged less than one million years ago. Ordination analysis separated samples by organ rather than by species. Organs differed in their degrees of expression variance within species and expression divergence between species. Variance was negatively correlated with expression breadth and protein interactivity, suggesting that pleiotropic constraints reduce gene expression variance within species. Variance was correlated with between-species divergence, consistent with a pattern expected from stabilizing selection and neutral evolution. Using an expression PST approach, we identified genes differentially expressed between species and found 16 genes uniquely expressed in one of the species. For one of these, DPP7, uniquely expressed in collared flycatcher, the absence of expression in pied flycatcher could be associated with a ≈20-kb deletion including 11 of 13 exons. This study of a young vertebrate speciation model system expands our knowledge of how gene expression evolves as natural populations become reproductively isolated.
Asunto(s)
Evolución Biológica , Flujo Genético , Selección Genética , Pájaros Cantores/clasificación , Animales , Femenino , Expresión Génica , Pleiotropía Genética , Genética de Población , Masculino , Modelos Genéticos , Pájaros Cantores/genética , Especificidad de la Especie , SueciaRESUMEN
Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been extended with additional support for comparative genomics and ontology information. Finally, we provide updated information about our methods for data access and resources for user training.
Asunto(s)
Bases de Datos Genéticas , Genómica , Animales , Cordados/genética , Variación Genética , Humanos , Internet , Ratones , Anotación de Secuencia Molecular , Fenotipo , RatasRESUMEN
The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
Asunto(s)
Bases de Datos Genéticas , Proteínas/genética , Animales , Exones , Genómica , Humanos , Internet , Ratones , Anotación de Secuencia Molecular , Análisis de SecuenciaRESUMEN
Within the ENCODE Consortium, GENCODE aimed to accurately annotate all protein-coding genes, pseudogenes, and noncoding transcribed loci in the human genome through manual curation and computational methods. Annotated transcript structures were assessed, and less well-supported loci were systematically, experimentally validated. Predicted exon-exon junctions were evaluated by RT-PCR amplification followed by highly multiplexed sequencing readout, a method we called RT-PCR-seq. Seventy-nine percent of all assessed junctions are confirmed by this evaluation procedure, demonstrating the high quality of the GENCODE gene set. RT-PCR-seq was also efficient to screen gene models predicted using the Human Body Map (HBM) RNA-seq data. We validated 73% of these predictions, thus confirming 1168 novel genes, mostly noncoding, which will further complement the GENCODE annotation. Our novel experimental validation pipeline is extremely sensitive, far more than unbiased transcriptome profiling through RNA sequencing, which is becoming the norm. For example, exon-exon junctions unique to GENCODE annotated transcripts are five times more likely to be corroborated with our targeted approach than with extensive large human transcriptome profiling. Data sets such as the HBM and ENCODE RNA-seq data fail sampling of low-expressed transcripts. Our RT-PCR-seq targeted approach also has the advantage of identifying novel exons of known genes, as we discovered unannotated exons in ~11% of assessed introns. We thus estimate that at least 18% of known loci have yet-unannotated exons. Our work demonstrates that the cataloging of all of the genic elements encoded in the human genome will necessitate a coordinated effort between unbiased and targeted approaches, like RNA-seq and RT-PCR-seq.
Asunto(s)
Perfilación de la Expresión Génica/métodos , Genoma Humano , Transcriptoma , Biología Computacional/métodos , Exones , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Intrones , Anotación de Secuencia Molecular , Sistemas de Lectura Abierta , Isoformas de ARN , ARN Mensajero/química , ARN Mensajero/genética , Reproducibilidad de los Resultados , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa , Sensibilidad y EspecificidadRESUMEN
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.