Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 76
Filtrer
1.
Nature ; 622(7981): 41-47, 2023 Oct.
Article de Anglais | MEDLINE | ID: mdl-37794265

RÉSUMÉ

Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.


Sujet(s)
Gènes , Génome humain , Annotation de séquence moléculaire , Isoformes de protéines , Humains , Génome humain/génétique , Annotation de séquence moléculaire/normes , Annotation de séquence moléculaire/tendances , Isoformes de protéines/génétique , Projet génome humain , Pseudogènes , ARN/génétique
2.
ArXiv ; 2023 Mar 24.
Article de Anglais | MEDLINE | ID: mdl-36994150

RÉSUMÉ

Scientists have been trying to identify all of the genes in the human genome since the initial draft of the genome was published in 2001. Over the intervening years, much progress has been made in identifying protein-coding genes, and the estimated number has shrunk to fewer than 20,000, although the number of distinct protein-coding isoforms has expanded dramatically. The invention of high-throughput RNA sequencing and other technological breakthroughs have led to an explosion in the number of reported non-coding RNA genes, although most of them do not yet have any known function. A combination of recent advances offers a path forward to identifying these functions and towards eventually completing the human gene catalogue. However, much work remains to be done before we have a universal annotation standard that includes all medically significant genes, maintains their relationships with different reference genomes, and describes clinically relevant genetic variants.

3.
Sci Data ; 9(1): 622, 2022 10 14.
Article de Anglais | MEDLINE | ID: mdl-36241754

RÉSUMÉ

Research software is a fundamental and vital part of research, yet significant challenges to discoverability, productivity, quality, reproducibility, and sustainability exist. Improving the practice of scholarship is a common goal of the open science, open source, and FAIR (Findable, Accessible, Interoperable and Reusable) communities and research software is now being understood as a type of digital object to which FAIR should be applied. This emergence reflects a maturation of the research community to better understand the crucial role of FAIR research software in maximising research value. The FAIR for Research Software (FAIR4RS) Working Group has adapted the FAIR Guiding Principles to create the FAIR Principles for Research Software (FAIR4RS Principles). The contents and context of the FAIR4RS Principles are summarised here to provide the basis for discussion of their adoption. Examples of implementation by organisations are provided to share information on how to maximise the value of research outputs, and to encourage others to amplify the importance and impact of this work.

7.
J Proteome Res ; 20(4): 1821-1825, 2021 04 02.
Article de Anglais | MEDLINE | ID: mdl-33720718

RÉSUMÉ

The large diversity of experimental methods in proteomics as well as their increasing usage across biological and clinical research has led to the development of hundreds if not thousands of software tools to aid in the analysis and interpretation of the resulting data. Detailed information about these tools needs to be collected, categorized, and validated to guarantee their optimal utilization. A tools registry like bio.tools enables users and developers to identify new tools with more powerful algorithms or to find tools with similar functions for comparison. Here we present the content of the registry, which now comprises more than 1000 proteomics tool entries. Furthermore, we discuss future applications and engagement with other community efforts resulting in a high impact on the bioinformatics landscape.


Sujet(s)
Protéomique , Logiciel , Algorithmes , Biologie informatique
8.
EMBO J ; 40(6): e107409, 2021 03 15.
Article de Anglais | MEDLINE | ID: mdl-33565128

RÉSUMÉ

A new inter-governmental research infrastructure, ELIXIR, aims to unify bioinformatics resources and life science data across Europe, thereby facilitating their mining and (re-)use.


Sujet(s)
Recherche biomédicale , Biologie informatique , Mémorisation et recherche des informations , Disciplines des sciences biologiques , Europe , Humains
9.
Nat Commun ; 11(1): 3695, 2020 07 29.
Article de Anglais | MEDLINE | ID: mdl-32728065

RÉSUMÉ

Pseudogenes are ideal markers of genome remodelling. In turn, the mouse is an ideal platform for studying them, particularly with the recent availability of strain-sequencing and transcriptional data. Here, combining both manual curation and automatic pipelines, we present a genome-wide annotation of the pseudogenes in the mouse reference genome and 18 inbred mouse strains (available via the mouse.pseudogene.org resource). We also annotate 165 unitary pseudogenes in mouse, and 303, in human. The overall pseudogene repertoire in mouse is similar to that in human in terms of size, biotype distribution, and family composition (e.g. with GAPDH and ribosomal proteins being the largest families). Notable differences arise in the pseudogene age distribution, with multiple retro-transpositional bursts in mouse evolutionary history and only one in human. Furthermore, in each strain about a fifth of all pseudogenes are unique, reflecting strain-specific evolution. Finally, we find that ~15% of the mouse pseudogenes are transcribed, and that highly transcribed parent genes tend to give rise to many processed pseudogenes.


Sujet(s)
Pseudogènes/génétique , Transcription génétique , Animaux , Séquence conservée/génétique , Évolution moléculaire , Gene Ontology , Génome , Humains , Souris de lignée C57BL , Annotation de séquence moléculaire , Spécificité d'espèce
10.
BMC Genomics ; 21(1): 196, 2020 Mar 03.
Article de Anglais | MEDLINE | ID: mdl-32126975

RÉSUMÉ

BACKGROUND: Olfactory receptor (OR) genes are the largest multi-gene family in the mammalian genome, with 874 in human and 1483 loci in mouse (including pseudogenes). The expansion of the OR gene repertoire has occurred through numerous duplication events followed by diversification, resulting in a large number of highly similar paralogous genes. These characteristics have made the annotation of the complete OR gene repertoire a complex task. Most OR genes have been predicted in silico and are typically annotated as intronless coding sequences. RESULTS: Here we have developed an expert curation pipeline to analyse and annotate every OR gene in the human and mouse reference genomes. By combining evidence from structural features, evolutionary conservation and experimental data, we have unified the annotation of these gene families, and have systematically determined the protein-coding potential of each locus. We have defined the non-coding regions of many OR genes, enabling us to generate full-length transcript models. We found that 13 human and 41 mouse OR loci have coding sequences that are split across two exons. These split OR genes are conserved across mammals, and are expressed at the same level as protein-coding OR genes with an intronless coding region. Our findings challenge the long-standing and widespread notion that the coding region of a vertebrate OR gene is contained within a single exon. CONCLUSIONS: This work provides the most comprehensive curation effort of the human and mouse OR gene repertoires to date. The complete annotation has been integrated into the GENCODE reference gene set, for immediate availability to the research community.


Sujet(s)
Séquence conservée , Exons/génétique , Locus de caractère quantitatif , Récepteurs olfactifs/génétique , Animaux , Curation de données/méthodes , Bases de données génétiques , Locus génétiques , Génome humain , Humains , Souris , Pseudogènes
11.
NPJ Genom Med ; 4: 31, 2019.
Article de Anglais | MEDLINE | ID: mdl-31814998

RÉSUMÉ

The developmental and epileptic encephalopathies (DEE) are a group of rare, severe neurodevelopmental disorders, where even the most thorough sequencing studies leave 60-65% of patients without a molecular diagnosis. Here, we explore the incompleteness of transcript models used for exome and genome analysis as one potential explanation for a lack of current diagnoses. Therefore, we have updated the GENCODE gene annotation for 191 epilepsy-associated genes, using human brain-derived transcriptomic libraries and other data to build 3,550 putative transcript models. Our annotations increase the transcriptional 'footprint' of these genes by over 674 kb. Using SCN1A as a case study, due to its close phenotype/genotype correlation with Dravet syndrome, we screened 122 people with Dravet syndrome or a similar phenotype with a panel of exon sequences representing eight established genes and identified two de novo SCN1A variants that now - through improved gene annotation - are ascribed to residing among our exons. These two (from 122 screened people, 1.6%) molecular diagnoses carry significant clinical implications. Furthermore, we identified a previously classified SCN1A intronic Dravet syndrome-associated variant that now lies within a deeply conserved exon. Our findings illustrate the potential gains of thorough gene annotation in improving diagnostic yields for genetic disorders.

12.
Nat Genet ; 50(11): 1574-1583, 2018 11.
Article de Anglais | MEDLINE | ID: mdl-30275530

RÉSUMÉ

We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.


Sujet(s)
Cartographie chromosomique , Locus génétiques , Génome , Haplotypes , Lignées consanguines de souris/génétique , Animaux , Animaux de laboratoire , Cartographie chromosomique/médecine vétérinaire , Haplotypes/génétique , Souris , Souris de lignée BALB C/génétique , Souris de lignée C3H/génétique , Souris de lignée C57BL/génétique , Souris de lignée CBA/génétique , Souris de lignée DBA/génétique , Souris de lignée NOD/génétique , Lignées consanguines de souris/classification , Annotation de séquence moléculaire , Phylogenèse , Polymorphisme de nucléotide simple , Spécificité d'espèce
13.
Nat Genet ; 49(12): 1731-1740, 2017 Dec.
Article de Anglais | MEDLINE | ID: mdl-29106417

RÉSUMÉ

Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete-many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.


Sujet(s)
Biologie informatique/méthodes , Séquençage nucléotidique à haut débit/méthodes , Annotation de séquence moléculaire/méthodes , ARN long non codant/génétique , Animaux , Analyse de profil d'expression de gènes/méthodes , Génomique/méthodes , Humains , Souris , Cadres ouverts de lecture/génétique , Reproductibilité des résultats
14.
Genome Med ; 9(1): 49, 2017 05 30.
Article de Anglais | MEDLINE | ID: mdl-28558813

RÉSUMÉ

The Human Genome Project and advances in DNA sequencing technologies have revolutionized the identification of genetic disorders through the use of clinical exome sequencing. However, in a considerable number of patients, the genetic basis remains unclear. As clinicians begin to consider whole-genome sequencing, an understanding of the processes and tools involved and the factors to consider in the annotation of the structure and function of genomic elements that might influence variant identification is crucial. Here, we discuss and illustrate the strengths and weaknesses of approaches for the annotation and classification of important elements of protein-coding genes, other genomic elements such as pseudogenes and the non-coding genome, comparative-genomic approaches for inferring gene function, and new technologies for aiding genome annotation, as a practical guide for clinicians when considering pathogenic sequence variation. Complete and accurate annotation of structure and function of genome features has the potential to reduce both false-negative (from missing annotation) and false-positive (from incorrect annotation) errors in causal variant identification in exome and genome sequences. Re-analysis of unsolved cases will be necessary as newer technology improves genome annotation, potentially improving the rate of diagnosis.


Sujet(s)
Techniques et procédures diagnostiques , Annotation de séquence moléculaire/méthodes , Analyse de séquence d'ADN/méthodes , Variation génétique , Humains , Pseudogènes
15.
Nat Rev Genet ; 17(12): 758-772, 2016 12.
Article de Anglais | MEDLINE | ID: mdl-27773922

RÉSUMÉ

A genome sequence is worthless if it cannot be deciphered; therefore, efforts to describe - or 'annotate' - genes began as soon as DNA sequences became available. Whereas early work focused on individual protein-coding genes, the modern genomic ocean is a complex maelstrom of alternative splicing, non-coding transcription and pseudogenes. Scientists - from clinicians to evolutionary biologists - need to navigate these waters, and this has led to the design of high-throughput, computationally driven annotation projects. The catalogues that are being produced are key resources for genome exploration, especially as they become integrated with expression, epigenomic and variation data sets. Their creation, however, remains challenging.


Sujet(s)
Eucaryotes/génétique , Génomique/méthodes , Annotation de séquence moléculaire/méthodes , Analyse de séquence d'ADN/méthodes , Animaux , Humains
16.
Nat Commun ; 7: 12339, 2016 08 17.
Article de Anglais | MEDLINE | ID: mdl-27531712

RÉSUMÉ

Long non-coding RNAs (lncRNAs) constitute a large, yet mostly uncharacterized fraction of the mammalian transcriptome. Such characterization requires a comprehensive, high-quality annotation of their gene structure and boundaries, which is currently lacking. Here we describe RACE-Seq, an experimental workflow designed to address this based on RACE (rapid amplification of cDNA ends) and long-read RNA sequencing. We apply RACE-Seq to 398 human lncRNA genes in seven tissues, leading to the discovery of 2,556 on-target, novel transcripts. About 60% of the targeted loci are extended in either 5' or 3', often reaching genomic hallmarks of gene boundaries. Analysis of the novel transcripts suggests that lncRNAs are as long, have as many exons and undergo as much alternative splicing as protein-coding genes, contrary to current assumptions. Overall, we show that RACE-Seq is an effective tool to annotate an organism's deep transcriptome, and compares favourably to other targeted sequencing techniques.


Sujet(s)
Séquençage nucléotidique à haut débit/méthodes , Réaction de polymérisation en chaîne/méthodes , ARN long non codant/génétique , Analyse de séquence d'ARN/méthodes , Exons/génétique , Locus génétiques , Humains , Annotation de séquence moléculaire , Spécificité d'organe/génétique , Étude de validation de principe , Isoformes de protéines/génétique , Isoformes de protéines/métabolisme , Sites d'épissage d'ARN/génétique , ARN long non codant/métabolisme , ARN messager/génétique , ARN messager/métabolisme , Transcriptome/génétique
17.
Nat Commun ; 7: 11778, 2016 06 02.
Article de Anglais | MEDLINE | ID: mdl-27250503

RÉSUMÉ

Complete annotation of the human genome is indispensable for medical research. The GENCODE consortium strives to provide this, augmenting computational and experimental evidence with manual annotation. The rapidly developing field of proteogenomics provides evidence for the translation of genes into proteins and can be used to discover and refine gene models. However, for both the proteomics and annotation groups, there is a lack of guidelines for integrating this data. Here we report a stringent workflow for the interpretation of proteogenomic data that could be used by the annotation community to interpret novel proteogenomic evidence. Based on reprocessing of three large-scale publicly available human data sets, we show that a conservative approach, using stringent filtering is required to generate valid identifications. Evidence has been found supporting 16 novel protein-coding genes being added to GENCODE. Despite this many peptide identifications in pseudogenes cannot be annotated due to the absence of orthogonal supporting evidence.


Sujet(s)
Génome humain , Annotation de séquence moléculaire/méthodes , Protéines/génétique , Protéogénomique/méthodes , Pseudogènes , Séquence d'acides aminés , Régulation de l'expression des gènes , Gene Ontology , Humains , Annotation de séquence moléculaire/statistiques et données numériques , Cadres ouverts de lecture , Protéines/métabolisme
18.
Nucleic Acids Res ; 44(D1): D710-6, 2016 Jan 04.
Article de Anglais | MEDLINE | ID: mdl-26687719

RÉSUMÉ

The Ensembl project (http://www.ensembl.org) is a system for genome annotation, analysis, storage and dissemination designed to facilitate the access of genomic annotation from chordates and key model organisms. It provides access to data from 87 species across our main and early access Pre! websites. This year we introduced three newly annotated species and released numerous updates across our supported species with a concentration on data for the latest genome assemblies of human, mouse, zebrafish and rat. We also provided two data updates for the previous human assembly, GRCh37, through a dedicated website (http://grch37.ensembl.org). Our tools, in particular the VEP, have been improved significantly through integration of additional third party data. REST is now capable of larger-scale analysis and our regulatory data BioMart can deliver faster results. The website is now capable of displaying long-range interactions such as those found in cis-regulated datasets. Finally we have launched a website optimized for mobile devices providing views of genes, variants and phenotypes. Our data is made available without restriction and all code is available from our GitHub organization site (http://github.com/Ensembl) under an Apache 2.0 license.


Sujet(s)
Bases de données génétiques , Génomique , Annotation de séquence moléculaire , Animaux , Gènes , Variation génétique , Humains , Internet , Souris , Protéines/génétique , Rats , Séquences d'acides nucléiques régulatrices , Logiciel
19.
J Proteome Res ; 14(12): 4945-8, 2015 Dec 04.
Article de Anglais | MEDLINE | ID: mdl-26367542

RÉSUMÉ

A report on the Wellcome Trust retreat on devising a consensus framework for the validation of novel human protein coding loci, held in Hinxton, U.K., May 11-13, 2015.


Sujet(s)
Génome humain , Protéines/analyse , Protéomique/méthodes , Pseudogènes , Variation génétique , Génomique/méthodes , Humains , Spectrométrie de masse/méthodes , Cadres ouverts de lecture , Protéines/génétique , ARN long non codant , Reproductibilité des résultats , Ribosomes/métabolisme
20.
Article de Anglais | MEDLINE | ID: mdl-26412852

RÉSUMÉ

Homeobox genes are a group of genes coding for transcription factors with a DNA-binding helix-turn-helix structure called a homeodomain and which play a crucial role in pattern formation during embryogenesis. Many homeobox genes are located in clusters and some of these, most notably the HOX genes, are known to have antisense or opposite strand long non-coding RNA (lncRNA) genes that play a regulatory role. Because automated annotation of both gene clusters and non-coding genes is fraught with difficulty (over-prediction, under-prediction, inaccurate transcript structures), we set out to manually annotate all homeobox genes in the mouse and human genomes. This includes all supported splice variants, pseudogenes and both antisense and flanking lncRNAs. One of the areas where manual annotation has a significant advantage is the annotation of duplicated gene clusters. After comprehensive annotation of all homeobox genes and their antisense genes in human and in mouse, we found some discrepancies with the current gene set in RefSeq regarding exact gene structures and coding versus pseudogene locus biotype. We also identified previously un-annotated pseudogenes in the DUX, Rhox and Obox gene clusters, which helped us re-evaluate and update the gene nomenclature in these regions. We found that human homeobox genes are enriched in antisense lncRNA loci, some of which are known to play a role in gene or gene cluster regulation, compared to their mouse orthologues. Of the annotated set of 241 human protein-coding homeobox genes, 98 have an antisense locus (41%) while of the 277 orthologous mouse genes, only 62 protein coding gene have an antisense locus (22%), based on publicly available transcriptional evidence.


Sujet(s)
Bases de données d'acides nucléiques , Génome humain , Protéines à homéodomaine/génétique , Annotation de séquence moléculaire/méthodes , Famille multigénique , Pseudogènes , Animaux , Motifs à hélice-tour-hélice , Humains , Souris , ARN long non codant/génétique
SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE
...