RESUMEN
The HUGO Gene Nomenclature Committee (HGNC) assigns unique symbols and names to human genes. The HGNC database (www.genenames.org) currently contains over 43 000 approved gene symbols, over 19 200 of which are assigned to protein-coding genes, 14 000 to pseudogenes and nearly 9000 to non-coding RNA genes. The public website, www.genenames.org, displays all approved nomenclature within Symbol Reports that contain data curated by HGNC nomenclature advisors and links to related genomic, clinical, and proteomic information. Here, we describe updates to our resource, including improvements to our search facility and new download features.
Asunto(s)
Bases de Datos Genéticas , Humanos , Genoma , Genómica , Proteómica , Seudogenes , Terminología como AsuntoRESUMEN
Research on non-coding RNA (ncRNA) is a rapidly expanding field. Providing an official gene symbol and name to ncRNA genes brings order to otherwise potential chaos as it allows unambiguous communication about each gene. The HUGO Gene Nomenclature Committee (HGNC, www.genenames.org) is the only group with the authority to approve symbols for human genes. The HGNC works with specialist advisors for different classes of ncRNA to ensure that ncRNA nomenclature is accurate and informative, where possible. Here, we review each major class of ncRNA that is currently annotated in the human genome and describe how each class is assigned a standardised nomenclature.
Asunto(s)
Genoma Humano/genética , ARN no Traducido/clasificación , Terminología como Asunto , Humanos , ARN no Traducido/genéticaRESUMEN
The use of approved nomenclature in publications is vital to enable effective scientific communication and is particularly crucial when discussing genes of clinical relevance. Here, we discuss several examples of cases where the failure of researchers to use a HUGO Gene Nomenclature Committee (HGNC)-approved symbol in publications has led to confusion between unrelated human genes in the literature. We also inform authors of the steps they can take to ensure that they use approved nomenclature in their manuscripts and discuss how referencing HGNC IDs can remove ambiguity when referring to genes that have previously been published with confusing alias symbols.
Asunto(s)
Bases de Datos Genéticas/normas , Genes/genética , Genoma Humano , Investigadores/normas , Terminología como Asunto , Genómica , HumanosRESUMEN
Multiple resources currently exist that predict orthologous relationships between genes. These resources differ both in the methodologies used and in the species they make predictions for. The HGNC Comparison of Orthology Predictions (HCOP) search tool integrates and displays data from multiple ortholog prediction resources for a specified human gene or set of genes. An indication of the reliability of a prediction is provided by the number of resources that support it. HCOP was originally designed to show orthology predictions between human and mouse but has been expanded to include data from a current total of 20 selected vertebrate and model organism species. The HCOP pipeline used to fetch and integrate the information from the disparate ortholog and nomenclature data resources has recently been rewritten, both to enable the inclusion of new data and to take advantage of modern web technologies. Data from HCOP are used extensively in our work naming genes as the Vertebrate Gene Nomenclature Committee (https://vertebrate.genenames.org).
Asunto(s)
Biología Computacional/métodos , Genómica/métodos , Homología de Secuencia , Programas Informáticos , Animales , Bases de Datos Genéticas , Humanos , Vertebrados , Navegador Web , Flujo de TrabajoRESUMEN
The HUGO Gene Nomenclature Committee (HGNC) is the sole group with the authority to approve symbols for human genes, including long non-coding RNA (lncRNA) genes. Use of approved symbols ensures that publications and biomedical databases are easily searchable and reduces the risks of confusion that can be caused by using the same symbol to refer to different genes or using many different symbols for the same gene. Here, we describe how the HGNC names lncRNA genes and review the nomenclature of the seven lncRNA genes most mentioned in the scientific literature.
Asunto(s)
ARN Largo no Codificante , Humanos , ARN Largo no Codificante/genética , Bases de Datos GenéticasRESUMEN
The HUGO Gene Nomenclature Committee assigns unique symbols and names to human genes. The use of approved nomenclature enables effective communication between researchers, and there are multiple examples of how the usage of unapproved alias symbols can lead to confusion. We discuss here a recent nomenclature update (May 2022) for a set of genes that encode proteins with a shared repeating ß-groove domain. Some of the proteins encoded by genes in this group have already been shown to function as lipid transporters. By working with researchers in the field, we have been able to introduce a new root symbol (BLTP, which stands for "bridge-like lipid transfer protein") for this domain-based gene group. This new nomenclature not only reflects the shared domain in these proteins, but also takes into consideration the mounting evidence of a shared lipid transport function.
Asunto(s)
Lípidos , HumanosRESUMEN
The HUGO Gene Nomenclature Committee (HGNC) has been providing standardized symbols and names for human genes since the late 1970s. As funding agencies change their priorities, finding financial support for critical biomedical resources such as the HGNC becomes ever more challenging. In this article, we outline the key roles the HGNC currently plays in aiding communication and the need for these activities to be maintained.
Asunto(s)
Bases de Datos Genéticas , Genómica , HumanosRESUMEN
Intermediate filament (IntFil) genes arose during early metazoan evolution, to provide mechanical support for plasma membranes contacting/interacting with other cells and the extracellular matrix. Keratin genes comprise the largest subset of IntFil genes. Whereas the first keratin gene appeared in sponge, and three genes in arthropods, more rapid increases in keratin genes occurred in lungfish and amphibian genomes, concomitant with land animal-sea animal divergence (~ 440 to 410 million years ago). Human, mouse and zebrafish genomes contain 18, 17 and 24 non-keratin IntFil genes, respectively. Human has 27 of 28 type I "acidic" keratin genes clustered at chromosome (Chr) 17q21.2, and all 26 type II "basic" keratin genes clustered at Chr 12q13.13. Mouse has 27 of 28 type I keratin genes clustered on Chr 11, and all 26 type II clustered on Chr 15. Zebrafish has 18 type I keratin genes scattered on five chromosomes, and 3 type II keratin genes on two chromosomes. Types I and II keratin clusters-reflecting evolutionary blooms of keratin genes along one chromosomal segment-are found in all land animal genomes examined, but not fishes; such rapid gene expansions likely reflect sudden requirements for many novel paralogous proteins having divergent functions to enhance species survival following sea-to-land transition. Using data from the Genotype-Tissue Expression (GTEx) project, tissue-specific keratin expression throughout the human body was reconstructed. Clustering of gene expression patterns revealed similarities in tissue-specific expression patterns for previously described "keratin pairs" (i.e., KRT1/KRT10, KRT8/KRT18, KRT5/KRT14, KRT6/KRT16 and KRT6/KRT17 proteins). The ClinVar database currently lists 26 human disease-causing variants within the various domains of keratin proteins.
Asunto(s)
Queratinas , Pez Cebra , Animales , Genoma , Queratinas/genética , Queratinas Tipo I/genética , RatonesRESUMEN
Following the draft sequence of the first human genome over 20 years ago, we have achieved unprecedented insights into the rules governing its evolution, often with direct translational relevance to specific diseases. However, staggering sequence complexity has also challenged the development of a more comprehensive understanding of human genome biology. In this context, interspecific genomic studies between humans and other animals have played a critical role in our efforts to decode human gene families. In this review, we focus on how the rapid surge of genome sequencing of both model and non-model organisms now provides a broader comparative framework poised to empower novel discoveries. We begin with a general overview of how comparative approaches are essential for understanding gene family evolution in the human genome, followed by a discussion of analyses of gene expression. We show how homology can provide insights into the genes and gene families associated with immune response, cancer biology, vision, chemosensation, and metabolism, by revealing similarity in processes among distant species. We then explain methodological tools that provide critical advances and show the limitations of common approaches. We conclude with a discussion of how these investigations position us to gain fundamental insights into the evolution of gene families among living organisms in general. We hope that our review catalyzes additional excitement and research on the emerging field of comparative genomics, while aiding the placement of the human genome into its existentially evolutionary context.
Asunto(s)
Evolución Molecular , Genómica , Animales , Humanos , Genoma , Secuencia de Bases , FilogeniaRESUMEN
The HUGO Gene Nomenclature Committee (HGNC) based at EMBL's European Bioinformatics Institute (EMBL-EBI) assigns unique symbols and names to human genes. There are over 42,000 approved gene symbols in our current database of which over 19 000 are for protein-coding genes. While we still update placeholder and problematic symbols, we are working towards stabilizing symbols where possible; over 2000 symbols for disease associated genes are now marked as stable in our symbol reports. All of our data is available at the HGNC website https://www.genenames.org. The Vertebrate Gene Nomenclature Committee (VGNC) was established to assign standardized nomenclature in line with human for vertebrate species lacking their own nomenclature committee. In addition to the previous VGNC core species of chimpanzee, cow, horse and dog, we now name genes in cat, macaque and pig. Gene groups have been added to VGNC and currently include two complex families: olfactory receptors (ORs) and cytochrome P450s (CYPs). In collaboration with specialists we have also named CYPs in species beyond our core set. All VGNC data is available at https://vertebrate.genenames.org/. This article provides an overview of our online data and resources, focusing on updates over the last two years.
Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Genes/genética , Genómica/métodos , Terminología como Asunto , Vertebrados/genética , Animales , Humanos , Internet , Proteínas/genética , Especificidad de la Especie , Interfaz Usuario-Computador , Vertebrados/clasificaciónRESUMEN
PURPOSE: Several groups and resources provide information that pertains to the validity of gene-disease relationships used in genomic medicine and research; however, universal standards and terminologies to define the evidence base for the role of a gene in disease and a single harmonized resource were lacking. To tackle this issue, the Gene Curation Coalition (GenCC) was formed. METHODS: The GenCC drafted harmonized definitions for differing levels of gene-disease validity on the basis of existing resources, and performed a modified Delphi survey with 3 rounds to narrow the list of terms. The GenCC also developed a unified database to display curated gene-disease validity assertions from its members. RESULTS: On the basis of 241 survey responses from the genetics community, a consensus term set was chosen for grading gene-disease validity and database submissions. As of December 2021, the database contained 15,241 gene-disease assertions on 4569 unique genes from 12 submitters. When comparing submissions to the database from distinct sources, conflicts in assertions of gene-disease validity ranged from 5.3% to 13.4%. CONCLUSION: Terminology standardization, sharing of gene-disease validity classifications, and resolution of curation conflicts will facilitate collaborations across international curation efforts and in turn, improve consistency in genetic testing and variant interpretation.
Asunto(s)
Bases de Datos Genéticas , Genómica , Pruebas Genéticas , Variación Genética , HumanosRESUMEN
Paired-box (PAX) genes encode a family of highly conserved transcription factors found in vertebrates and invertebrates. PAX proteins are defined by the presence of a paired domain that is evolutionarily conserved across phylogenies. Inclusion of a homeodomain and/or an octapeptide linker subdivides PAX proteins into four groups. Often termed "master regulators", PAX proteins orchestrate tissue and organ development throughout cell differentiation and lineage determination, and are essential for tissue structure and function through maintenance of cell identity. Mutations in PAX genes are associated with myriad human diseases (e.g., microphthalmia, anophthalmia, coloboma, hypothyroidism, acute lymphoblastic leukemia). Transcriptional regulation by PAX proteins is, in part, modulated by expression of alternatively spliced transcripts. Herein, we provide a genomics update on the nine human PAX family members and PAX homologs in 16 additional species. We also present a comprehensive summary of human tissue-specific PAX transcript variant expression and describe potential functional significance of PAX isoforms. While the functional roles of PAX proteins in developmental diseases and cancer are well characterized, much remains to be understood regarding the functional roles of PAX isoforms in human health. We anticipate the analysis of tissue-specific PAX transcript variant expression presented herein can serve as a starting point for such research endeavors.
Asunto(s)
Predisposición Genética a la Enfermedad , Factores de Transcripción Paired Box/genética , Empalme Alternativo , Animales , Mapeo Cromosómico , Evolución Molecular , Humanos , Filogenia , ARN Mensajero/genética , Transcripción GenéticaRESUMEN
Lipocalins (LCNs) are members of a family of evolutionarily conserved genes present in all kingdoms of life. There are 19 LCN-like genes in the human genome, and 45 Lcn-like genes in the mouse genome, which include 22 major urinary protein (Mup) genes. The Mup genes, plus 29 of 30 Mup-ps pseudogenes, are all located together on chromosome (Chr) 4; evidence points to an "evolutionary bloom" that resulted in this Mup cluster in mouse, syntenic to the human Chr 9q32 locus at which a single MUPP pseudogene is located. LCNs play important roles in physiological processes by binding and transporting small hydrophobic molecules -such as steroid hormones, odorants, retinoids, and lipids-in plasma and other body fluids. LCNs are extensively used in clinical practice as biochemical markers. LCN-like proteins (18-40 kDa) have the characteristic eight ß-strands creating a barrel structure that houses the binding-site; LCNs are synthesized in the liver as well as various secretory tissues. In rodents, MUPs are involved in communication of information in urine-derived scent marks, serving as signatures of individual identity, or as kairomones (to elicit fear behavior). MUPs also participate in regulation of glucose and lipid metabolism via a mechanism not well understood. Although much has been learned about LCNs and MUPs in recent years, more research is necessary to allow better understanding of their physiological functions, as well as their involvement in clinical disorders.
Asunto(s)
Evolución Molecular , Lipocalinas/genética , Animales , Genoma Humano , Humanos , Lipocalinas/metabolismo , Ratones , Familia de MultigenesRESUMEN
The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.
Asunto(s)
Secuencia de Consenso , Bases de Datos Genéticas , Sistemas de Lectura Abierta , Animales , Curaduría de Datos/métodos , Curaduría de Datos/normas , Bases de Datos Genéticas/normas , Guías como Asunto , Humanos , Ratones , Anotación de Secuencia Molecular , National Library of Medicine (U.S.) , Estados Unidos , Interfaz Usuario-ComputadorRESUMEN
The HUGO Gene Nomenclature Committee (HGNC) based at the European Bioinformatics Institute (EMBL-EBI) assigns unique symbols and names to human genes. Currently the HGNC database contains almost 40 000 approved gene symbols, over 19 000 of which represent protein-coding genes. In addition to naming genomic loci we manually curate genes into family sets based on shared characteristics such as homology, function or phenotype. We have recently updated our gene family resources and introduced new improved visualizations which can be seen alongside our gene symbol reports on our primary website http://www.genenames.org In 2016 we expanded our remit and formed the Vertebrate Gene Nomenclature Committee (VGNC) which is responsible for assigning names to vertebrate species lacking a dedicated nomenclature group. Using the chimpanzee genome as a pilot project we have approved symbols and names for over 14 500 protein-coding genes in chimpanzee, and have developed a new website http://vertebrate.genenames.org to distribute these data. Here, we review our online data and resources, focusing particularly on the improvements and new developments made during the last two years.
Asunto(s)
Bases de Datos Genéticas , Genes , Genoma , Genómica/métodos , Terminología como Asunto , Vertebrados , Navegador Web , Animales , Humanos , Familia de Multigenes , Motor de BúsquedaRESUMEN
RNAcentral is a database of non-coding RNA (ncRNA) sequences that aggregates data from specialised ncRNA resources and provides a single entry point for accessing ncRNA sequences of all ncRNA types from all organisms. Since its launch in 2014, RNAcentral has integrated twelve new resources, taking the total number of collaborating database to 22, and began importing new types of data, such as modified nucleotides from MODOMICS and PDB. We created new species-specific identifiers that refer to unique RNA sequences within a context of single species. The website has been subject to continuous improvements focusing on text and sequence similarity searches as well as genome browsing functionality. All RNAcentral data is provided for free and is available for browsing, bulk downloads, and programmatic access at http://rnacentral.org/.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , ARN no Traducido/química , Animales , Genómica , Humanos , Nucleótidos/química , Análisis de Secuencia de ARN , Especificidad de la EspecieAsunto(s)
Proteínas/clasificación , ARN/genética , Terminología como Asunto , Humanos , Proteínas/genética , Proteínas/normasRESUMEN
The human genome contains 25 genes coding for selenocysteine-containing proteins (selenoproteins). These proteins are involved in a variety of functions, most notably redox homeostasis. Selenoprotein enzymes with known functions are designated according to these functions: TXNRD1, TXNRD2, and TXNRD3 (thioredoxin reductases), GPX1, GPX2, GPX3, GPX4, and GPX6 (glutathione peroxidases), DIO1, DIO2, and DIO3 (iodothyronine deiodinases), MSRB1 (methionine sulfoxide reductase B1), and SEPHS2 (selenophosphate synthetase 2). Selenoproteins without known functions have traditionally been denoted by SEL or SEP symbols. However, these symbols are sometimes ambiguous and conflict with the approved nomenclature for several other genes. Therefore, there is a need to implement a rational and coherent nomenclature system for selenoprotein-encoding genes. Our solution is to use the root symbol SELENO followed by a letter. This nomenclature applies to SELENOF (selenoprotein F, the 15-kDa selenoprotein, SEP15), SELENOH (selenoprotein H, SELH, C11orf31), SELENOI (selenoprotein I, SELI, EPT1), SELENOK (selenoprotein K, SELK), SELENOM (selenoprotein M, SELM), SELENON (selenoprotein N, SEPN1, SELN), SELENOO (selenoprotein O, SELO), SELENOP (selenoprotein P, SeP, SEPP1, SELP), SELENOS (selenoprotein S, SELS, SEPS1, VIMP), SELENOT (selenoprotein T, SELT), SELENOV (selenoprotein V, SELV), and SELENOW (selenoprotein W, SELW, SEPW1). This system, approved by the HUGO Gene Nomenclature Committee, also resolves conflicting, missing, and ambiguous designations for selenoprotein genes and is applicable to selenoproteins across vertebrates.
Asunto(s)
Selenoproteínas/clasificación , Selenoproteínas/genética , Humanos , Terminología como AsuntoRESUMEN
The HUGO Gene Nomenclature Committee (HGNC) approves unique gene symbols and names for human loci. As well as naming genomic loci, we manually curate genes into family sets based on shared characteristics such as function, homology or phenotype. Each HGNC gene family has its own dedicated gene family report on our website, www.genenames.org . We have recently redesigned these reports to support the visualisation and browsing of complex relationships between families and to provide extra curated information such as family descriptions, protein domain graphics and gene family aliases. Here, we review how our gene families are curated and explain how to view, search and download the gene family data.