Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 76
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Nature ; 622(7981): 41-47, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37794265

ABSTRACT

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.


Subject(s)
Genes , Genome, Human , Molecular Sequence Annotation , Protein Isoforms , Humans , Genome, Human/genetics , Molecular Sequence Annotation/standards , Molecular Sequence Annotation/trends , Protein Isoforms/genetics , Human Genome Project , Pseudogenes , RNA/genetics
2.
EMBO J ; 40(6): e107409, 2021 03 15.
Article in English | MEDLINE | ID: mdl-33565128

ABSTRACT

A new inter-governmental research infrastructure, ELIXIR, aims to unify bioinformatics resources and life science data across Europe, thereby facilitating their mining and (re-)use.


Subject(s)
Biomedical Research , Computational Biology , Information Storage and Retrieval , Biological Science Disciplines , Europe , Humans
3.
Nat Rev Genet ; 17(12): 758-772, 2016 12.
Article in English | MEDLINE | ID: mdl-27773922

ABSTRACT

A genome sequence is worthless if it cannot be deciphered; therefore, efforts to describe - or 'annotate' - genes began as soon as DNA sequences became available. Whereas early work focused on individual protein-coding genes, the modern genomic ocean is a complex maelstrom of alternative splicing, non-coding transcription and pseudogenes. Scientists - from clinicians to evolutionary biologists - need to navigate these waters, and this has led to the design of high-throughput, computationally driven annotation projects. The catalogues that are being produced are key resources for genome exploration, especially as they become integrated with expression, epigenomic and variation data sets. Their creation, however, remains challenging.


Subject(s)
Eukaryota/genetics , Genomics/methods , Molecular Sequence Annotation/methods , Sequence Analysis, DNA/methods , Animals , Humans
4.
J Proteome Res ; 20(4): 1821-1825, 2021 04 02.
Article in English | MEDLINE | ID: mdl-33720718

ABSTRACT

The large diversity of experimental methods in proteomics as well as their increasing usage across biological and clinical research has led to the development of hundreds if not thousands of software tools to aid in the analysis and interpretation of the resulting data. Detailed information about these tools needs to be collected, categorized, and validated to guarantee their optimal utilization. A tools registry like bio.tools enables users and developers to identify new tools with more powerful algorithms or to find tools with similar functions for comparison. Here we present the content of the registry, which now comprises more than 1000 proteomics tool entries. Furthermore, we discuss future applications and engagement with other community efforts resulting in a high impact on the bioinformatics landscape.


Subject(s)
Proteomics , Software , Algorithms , Computational Biology
5.
BMC Genomics ; 21(1): 196, 2020 Mar 03.
Article in English | MEDLINE | ID: mdl-32126975

ABSTRACT

BACKGROUND: Olfactory receptor (OR) genes are the largest multi-gene family in the mammalian genome, with 874 in human and 1483 loci in mouse (including pseudogenes). The expansion of the OR gene repertoire has occurred through numerous duplication events followed by diversification, resulting in a large number of highly similar paralogous genes. These characteristics have made the annotation of the complete OR gene repertoire a complex task. Most OR genes have been predicted in silico and are typically annotated as intronless coding sequences. RESULTS: Here we have developed an expert curation pipeline to analyse and annotate every OR gene in the human and mouse reference genomes. By combining evidence from structural features, evolutionary conservation and experimental data, we have unified the annotation of these gene families, and have systematically determined the protein-coding potential of each locus. We have defined the non-coding regions of many OR genes, enabling us to generate full-length transcript models. We found that 13 human and 41 mouse OR loci have coding sequences that are split across two exons. These split OR genes are conserved across mammals, and are expressed at the same level as protein-coding OR genes with an intronless coding region. Our findings challenge the long-standing and widespread notion that the coding region of a vertebrate OR gene is contained within a single exon. CONCLUSIONS: This work provides the most comprehensive curation effort of the human and mouse OR gene repertoires to date. The complete annotation has been integrated into the GENCODE reference gene set, for immediate availability to the research community.


Subject(s)
Conserved Sequence , Exons/genetics , Quantitative Trait Loci , Receptors, Odorant/genetics , Animals , Data Curation/methods , Databases, Genetic , Genetic Loci , Genome, Human , Humans , Mice , Pseudogenes
8.
Genes Dev ; 25(1): 1-10, 2011 Jan 01.
Article in English | MEDLINE | ID: mdl-21205862

ABSTRACT

The first wave of personal genomes documents how no single individual genome contains the full complement of functional genes. Here, we describe the extent of variation in gene and pseudogene numbers between individuals arising from inactivation events such as premature termination or aberrant splicing due to single-nucleotide polymorphisms. This highlights the inadequacy of the current reference sequence and gene set. We present a proposal to define a reference gene set that will remain stable as more individuals are sequenced. In particular, we recommend that the ancestral allele be used to define the reference sequence from which a core human reference gene annotation set can be derived. In addition, we call for the development of an expanded gene set to include human-specific genes that have arisen recently and are absent from the ancestral set.


Subject(s)
Gene Silencing/physiology , Genetic Privacy , Molecular Sequence Annotation , Genetic Privacy/trends , Genetic Variation , Genome, Human/genetics , Humans , Polymorphism, Single Nucleotide
9.
Nucleic Acids Res ; 44(D1): D710-6, 2016 Jan 04.
Article in English | MEDLINE | ID: mdl-26687719

ABSTRACT

The Ensembl project (http://www.ensembl.org) is a system for genome annotation, analysis, storage and dissemination designed to facilitate the access of genomic annotation from chordates and key model organisms. It provides access to data from 87 species across our main and early access Pre! websites. This year we introduced three newly annotated species and released numerous updates across our supported species with a concentration on data for the latest genome assemblies of human, mouse, zebrafish and rat. We also provided two data updates for the previous human assembly, GRCh37, through a dedicated website (http://grch37.ensembl.org). Our tools, in particular the VEP, have been improved significantly through integration of additional third party data. REST is now capable of larger-scale analysis and our regulatory data BioMart can deliver faster results. The website is now capable of displaying long-range interactions such as those found in cis-regulated datasets. Finally we have launched a website optimized for mobile devices providing views of genes, variants and phenotypes. Our data is made available without restriction and all code is available from our GitHub organization site (http://github.com/Ensembl) under an Apache 2.0 license.


Subject(s)
Databases, Genetic , Genomics , Molecular Sequence Annotation , Animals , Genes , Genetic Variation , Humans , Internet , Mice , Proteins/genetics , Rats , Regulatory Sequences, Nucleic Acid , Software
11.
Nature ; 474(7351): 337-42, 2011 Jun 15.
Article in English | MEDLINE | ID: mdl-21677750

ABSTRACT

Gene targeting in embryonic stem cells has become the principal technology for manipulation of the mouse genome, offering unrivalled accuracy in allele design and access to conditional mutagenesis. To bring these advantages to the wider research community, large-scale mouse knockout programmes are producing a permanent resource of targeted mutations in all protein-coding genes. Here we report the establishment of a high-throughput gene-targeting pipeline for the generation of reporter-tagged, conditional alleles. Computational allele design, 96-well modular vector construction and high-efficiency gene-targeting strategies have been combined to mutate genes on an unprecedented scale. So far, more than 12,000 vectors and 9,000 conditional targeted alleles have been produced in highly germline-competent C57BL/6N embryonic stem cells. High-throughput genome engineering highlighted by this study is broadly applicable to rat and human stem cells and provides a foundation for future genome-wide efforts aimed at deciphering the function of all genes encoded by the mammalian genome.


Subject(s)
Gene Deletion , Gene Knockout Techniques/methods , Genes/genetics , Genetic Association Studies/methods , Genome/genetics , Mice, Knockout/genetics , Alleles , Animals , Computational Biology , Embryonic Stem Cells/cytology , Embryonic Stem Cells/metabolism , Genes, Lethal/genetics , Genetic Vectors/genetics , Genomics , Genotype , Humans , Mice , Mice, Inbred C57BL , Mutagenesis, Insertional/methods , Phenotype , Polymerase Chain Reaction , Rats
12.
Nucleic Acids Res ; 43(Database issue): D123-9, 2015 01.
Article in English | MEDLINE | ID: mdl-25352543

ABSTRACT

The field of non-coding RNA biology has been hampered by the lack of availability of a comprehensive, up-to-date collection of accessioned RNA sequences. Here we present the first release of RNAcentral, a database that collates and integrates information from an international consortium of established RNA sequence databases. The initial release contains over 8.1 million sequences, including representatives of all major functional classes. A web portal (http://rnacentral.org) provides free access to data, search functionality, cross-references, source code and an integrated genome browser for selected species.


Subject(s)
Databases, Nucleic Acid , RNA, Untranslated/chemistry , Chromosome Mapping , Humans , Internet , RNA, Untranslated/genetics , Sequence Analysis, RNA
13.
Nucleic Acids Res ; 43(Database issue): D662-9, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25352552

ABSTRACT

Ensembl (http://www.ensembl.org) is a genomic interpretation system providing the most up-to-date annotations, querying tools and access methods for chordates and key model organisms. This year we released updated annotation (gene models, comparative genomics, regulatory regions and variation) on the new human assembly, GRCh38, although we continue to support researchers using the GRCh37.p13 assembly through a dedicated site (http://grch37.ensembl.org). Our Regulatory Build has been revamped to identify regulatory regions of interest and to efficiently highlight their activity across disparate epigenetic data sets. A number of new interfaces allow users to perform large-scale comparisons of their data against our annotations. The REST server (http://rest.ensembl.org), which allows programs written in any language to query our databases, has moved to a full service alongside our upgraded website tools. Our online Variant Effect Predictor tool has been updated to process more variants and calculate summary statistics. Lastly, the WiggleTools package enables users to summarize large collections of data sets and view them as single tracks in Ensembl. The Ensembl code base itself is more accessible: it is now hosted on our GitHub organization page (https://github.com/Ensembl) under an Apache 2.0 open source license.


Subject(s)
Databases, Nucleic Acid , Genomics , Animals , Epigenesis, Genetic , Genetic Variation , Genome, Human , Humans , Internet , Mice , Molecular Sequence Annotation , Regulatory Sequences, Nucleic Acid , Software
14.
Proc Natl Acad Sci U S A ; 111(37): 13361-6, 2014 Sep 16.
Article in English | MEDLINE | ID: mdl-25157146

ABSTRACT

Pseudogenes are degraded fossil copies of genes. Here, we report a comparison of pseudogenes spanning three phyla, leveraging the completed annotations of the human, worm, and fly genomes, which we make available as an online resource. We find that pseudogenes are lineage specific, much more so than protein-coding genes, reflecting the different remodeling processes marking each organism's genome evolution. The majority of human pseudogenes are processed, resulting from a retrotranspositional burst at the dawn of the primate lineage. This burst can be seen in the largely uniform distribution of pseudogenes across the genome, their preservation in areas with low recombination rates, and their preponderance in highly expressed gene families. In contrast, worm and fly pseudogenes tell a story of numerous duplication events. In worm, these duplications have been preserved through selective sweeps, so we see a large number of pseudogenes associated with highly duplicated families such as chemoreceptors. However, in fly, the large effective population size and high deletion rate resulted in a depletion of the pseudogene complement. Despite large variations between these species, we also find notable similarities. Overall, we identify a broad spectrum of biochemical activity for pseudogenes, with the majority in each organism exhibiting varying degrees of partial activity. In particular, we identify a consistent amount of transcription (∼15%) across all species, suggesting a uniform degradation process. Also, we see a uniform decay of pseudogene promoter activity relative to their coding counterparts and identify a number of pseudogenes with conserved upstream sequences and activity, hinting at potential regulatory roles.


Subject(s)
Caenorhabditis elegans/genetics , Drosophila melanogaster/genetics , Phylogeny , Pseudogenes/genetics , Animals , Evolution, Molecular , Genetic Association Studies , Humans , Molecular Sequence Annotation , Promoter Regions, Genetic/genetics , Sequence Homology, Nucleic Acid
15.
Hum Mol Genet ; 23(22): 5866-78, 2014 Nov 15.
Article in English | MEDLINE | ID: mdl-24939910

ABSTRACT

Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.


Subject(s)
Proteins/genetics , Computational Biology , Genome, Human , Humans , Open Reading Frames , Peptides/genetics , Proteins/metabolism , Proteomics
16.
Hum Mol Genet ; 23(16): 4420-32, 2014 Aug 15.
Article in English | MEDLINE | ID: mdl-24688116

ABSTRACT

The genetic contribution to the variation in human lifespan is ∼ 25%. Despite the large number of identified disease-susceptibility loci, it is not known which loci influence population mortality. We performed a genome-wide association meta-analysis of 7729 long-lived individuals of European descent (≥ 85 years) and 16 121 younger controls (<65 years) followed by replication in an additional set of 13 060 long-lived individuals and 61 156 controls. In addition, we performed a subset analysis in cases aged ≥ 90 years. We observed genome-wide significant association with longevity, as reflected by survival to ages beyond 90 years, at a novel locus, rs2149954, on chromosome 5q33.3 (OR = 1.10, P = 1.74 × 10(-8)). We also confirmed association of rs4420638 on chromosome 19q13.32 (OR = 0.72, P = 3.40 × 10(-36)), representing the TOMM40/APOE/APOC1 locus. In a prospective meta-analysis (n = 34 103), the minor allele of rs2149954 (T) on chromosome 5q33.3 associates with increased survival (HR = 0.95, P = 0.003). This allele has previously been reported to associate with low blood pressure in middle age. Interestingly, the minor allele (T) associates with decreased cardiovascular mortality risk, independent of blood pressure. We report on the first GWAS-identified longevity locus on chromosome 5q33.3 influencing survival in the general European population. The minor allele of this locus associates with low blood pressure in middle age, although the contribution of this allele to survival may be less dependent on blood pressure. Hence, the pleiotropic mechanisms by which this intragenic variation contributes to lifespan regulation have to be elucidated.


Subject(s)
Genetic Loci/physiology , Longevity/genetics , Age Factors , Aged , Aged, 80 and over , Cardiovascular Diseases/genetics , Chromosome Mapping , Chromosomes, Human, Pair 19 , Chromosomes, Human, Pair 5 , Female , Genome-Wide Association Study , Humans , Hypertension/genetics , Male , Phenotype , Prospective Studies , White People
17.
Genome Res ; 23(12): 1961-73, 2013 Dec.
Article in English | MEDLINE | ID: mdl-24172201

ABSTRACT

The last decade has seen tremendous effort committed to the annotation of the human genome sequence, most notably perhaps in the form of the ENCODE project. One of the major findings of ENCODE, and other genome analysis projects, is that the human transcriptome is far larger and more complex than previously thought. This complexity manifests, for example, as alternative splicing within protein-coding genes, as well as in the discovery of thousands of long noncoding RNAs. It is also possible that significant numbers of human transcripts have not yet been described by annotation projects, while existing transcript models are frequently incomplete. The question as to what proportion of this complexity is truly functional remains open, however, and this ambiguity presents a serious challenge to genome scientists. In this article, we will discuss the current state of human transcriptome annotation, drawing on our experience gained in generating the GENCODE gene annotation set. We highlight the gaps in our knowledge of transcript functionality that remain, and consider the potential computational and experimental strategies that can be used to help close them. We propose that an understanding of the true overlap between transcriptional complexity and functionality will not be gained in the short term. However, significant steps toward obtaining this knowledge can now be taken by using an integrated strategy, combining all of the experimental resources at our disposal.


Subject(s)
Genomics/methods , Molecular Sequence Annotation , Proteins/genetics , Transcriptome , Alternative Splicing , Animals , Databases, Genetic , Evolution, Molecular , Genome, Human , Humans , Proteomics , RNA, Long Noncoding , Sequence Alignment
18.
Nat Methods ; 10(12): 1177-84, 2013 Dec.
Article in English | MEDLINE | ID: mdl-24185837

ABSTRACT

We evaluated 25 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression-level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression-level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.


Subject(s)
Computational Biology/methods , RNA Splicing , Sequence Analysis, RNA/methods , Algorithms , Animals , Caenorhabditis elegans , Drosophila melanogaster , Exons , Gene Expression Profiling , Genome , Humans , Introns , RNA Splice Sites , RNA, Messenger/metabolism , Software
19.
Nat Methods ; 10(12): 1185-91, 2013 Dec.
Article in English | MEDLINE | ID: mdl-24185836

ABSTRACT

High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. To assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. In total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.


Subject(s)
RNA Splicing , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Animals , Chromosome Mapping/methods , Computational Biology/methods , Exons , False Positive Reactions , High-Throughput Nucleotide Sequencing/methods , Humans , K562 Cells , Mice , RNA, Messenger/metabolism , Reproducibility of Results , Software
20.
Nucleic Acids Res ; 42(Database issue): D771-9, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24316575

ABSTRACT

The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).


Subject(s)
Databases, Genetic , Genome , Molecular Sequence Annotation , Animals , Genome, Human , Genomics , Humans , Internet , Mice , Mice, Inbred NOD , Mice, Knockout , Rats , Swine/genetics , Zebrafish/genetics
SELECTION OF CITATIONS
SEARCH DETAIL