Search | VHL Regional Portal

Show: 20 | 50 | 100

Results 1 - 10 de 10

Filter

A joint NCBI and EMBL-EBI transcript set for clinical genomics and research.

Morales, Joannella; Pujar, Shashikant; Loveland, Jane E; Astashyn, Alex; Bennett, Ruth; Berry, Andrew; Cox, Eric; Davidson, Claire; Ermolaeva, Olga; Farrell, Catherine M; Fatima, Reham; Gil, Laurent; Goldfarb, Tamara; Gonzalez, Jose M; Haddad, Diana; Hardy, Matthew; Hunt, Toby; Jackson, John; Joardar, Vinita S; Kay, Michael; Kodali, Vamsi K; McGarvey, Kelly M; McMahon, Aoife; Mudge, Jonathan M; Murphy, Daniel N; Murphy, Michael R; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Thibaud-Nissen, Françoise; Threadgold, Glen; Vatsan, Anjana R; Wallin, Craig; Webb, David; Flicek, Paul; Birney, Ewan; Pruitt, Kim D; Frankish, Adam; Cunningham, Fiona; Murphy, Terence D.

Nature ; 604(7905): 310-315, 2022 04.

Article in English | MEDLINE | ID: mdl-35388217

ABSTRACT

Comprehensive genome annotation is essential to understand the impact of clinically relevant variants. However, the absence of a standard for clinical reporting and browser display complicates the process of consistent interpretation and reporting. To address these challenges, Ensembl/GENCODE1 and RefSeq2 launched a joint initiative, the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration, to converge on human gene and transcript annotation and to jointly define a high-value set of transcripts and corresponding proteins. Here, we describe the MANE transcript sets for use as universal standards for variant reporting and browser display. The MANE Select set identifies a representative transcript for each human protein-coding gene, whereas the MANE Plus Clinical set provides additional transcripts at loci where the Select transcripts alone are not sufficient to report all currently known clinical variants. Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such that the identifiers can be used synonymously. We have now released MANE Select transcripts for 97% of human protein-coding genes, including all American College of Medical Genetics and Genomics Secondary Findings list v3.0 (ref. 3) genes. MANE transcripts are accessible from major genome browsers and key resources. Widespread adoption of these transcript sets will increase the consistency of reporting, facilitate the exchange of data regardless of the annotation source and help to streamline clinical interpretation.

Subject(s)

Computational Biology , Databases, Genetic , Genomics , Genome , Humans , Information Dissemination , Molecular Sequence Annotation , National Library of Medicine (U.S.) , United States

RefSeq curation and annotation of stop codon recoding in vertebrates.

Rajput, Bhanu; Pruitt, Kim D; Murphy, Terence D.

Nucleic Acids Res ; 47(2): 594-606, 2019 01 25.

Article in English | MEDLINE | ID: mdl-30535227

ABSTRACT

Recoding of stop codons as amino acid-specifying codons is a co-translational event that enables C-terminal extension of a protein. Synthesis of selenoproteins requires recoding of internal UGA stop codons to the 21st non-standard amino acid selenocysteine (Sec) and plays a vital role in human health and disease. Separately, canonical stop codons can be recoded to specify standard amino acids in a process known as stop codon readthrough (SCR), producing extended protein isoforms with potential novel functions. Conventional computational tools cannot distinguish between the dual functionality of stop codons as stop signals and sense codons, resulting in misannotation of selenoprotein gene products and failure to predict SCR. Manual curation is therefore required to correctly represent recoded gene products and their functions. Our goal was to provide accurately curated and annotated datasets of selenoprotein and SCR transcript and protein records to serve as annotation standards and to promote basic and biomedical research. Gene annotations were curated in nine vertebrate model organisms and integrated into NCBI's Reference Sequence (RefSeq) dataset, resulting in 247 selenoprotein genes encoding 322 selenoproteins, and 93 genes exhibiting SCR encoding 94 SCR isoforms.

Subject(s)

Codon, Terminator , Data Curation , Databases, Genetic , Molecular Sequence Annotation , Selenoproteins/genetics , Vertebrates/genetics , Animals , Cattle , Humans , Mice , Proteome , Rats

Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation.

Pujar, Shashikant; O'Leary, Nuala A; Farrell, Catherine M; Loveland, Jane E; Mudge, Jonathan M; Wallin, Craig; Girón, Carlos G; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; Martin, Fergal J; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Suner, Marie-Marthe; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bruford, Elspeth A; Bult, Carol J; Frankish, Adam; Murphy, Terence; Pruitt, Kim D.

Nucleic Acids Res ; 46(D1): D221-D228, 2018 01 04.

Article in English | MEDLINE | ID: mdl-29126148

ABSTRACT

The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.

Subject(s)

Consensus Sequence , Databases, Genetic , Open Reading Frames , Animals , Data Curation/methods , Data Curation/standards , Databases, Genetic/standards , Guidelines as Topic , Humans , Mice , Molecular Sequence Annotation , National Library of Medicine (U.S.) , United States , User-Computer Interface

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

O'Leary, Nuala A; Wright, Mathew W; Brister, J Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S; Kodali, Vamsi K; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M; Murphy, Michael R; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H; Rausch, Daniel; Riddick, Lillian D; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E; Vatsan, Anjana R; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J; Kimchi, Avi.

Nucleic Acids Res ; 44(D1): D733-45, 2016 Jan 04.

Article in English | MEDLINE | ID: mdl-26553804

ABSTRACT

The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.

Subject(s)

Databases, Genetic , Genomics , Animals , Cattle , Gene Expression Profiling , Genome, Fungal , Genome, Human , Genome, Microbial , Genome, Plant , Genome, Viral , Genomics/standards , Humans , Invertebrates/genetics , Mice , Molecular Sequence Annotation , Nematoda/genetics , Phylogeny , RNA, Long Noncoding/genetics , Rats , Reference Standards , Sequence Analysis, Protein , Sequence Analysis, RNA , Vertebrates/genetics

Mouse genome annotation by the RefSeq project.

McGarvey, Kelly M; Goldfarb, Tamara; Cox, Eric; Farrell, Catherine M; Gupta, Tripti; Joardar, Vinita S; Kodali, Vamsi K; Murphy, Michael R; O'Leary, Nuala A; Pujar, Shashikant; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Webb, David; Wright, Mathew W; Murphy, Terence D; Pruitt, Kim D.

Mamm Genome ; 26(9-10): 379-90, 2015 Oct.

Article in English | MEDLINE | ID: mdl-26215545

ABSTRACT

Complete and accurate annotation of the mouse genome is critical to the advancement of research conducted on this important model organism. The National Center for Biotechnology Information (NCBI) develops and maintains many useful resources to assist the mouse research community. In particular, the reference sequence (RefSeq) database provides high-quality annotation of multiple mouse genome assemblies using a combinatorial approach that leverages computation, manual curation, and collaboration. Implementation of this conservative and rigorous approach, which focuses on representation of only full-length and non-redundant data, produces high-quality annotation products. RefSeq records explicitly link sequences to current knowledge in a timely manner, updating public records regularly and rapidly in response to nomenclature updates, addition of new relevant publications, collaborator discussion, and user feedback. Whole genome re-annotation is also conducted at least every 12-18 months, and often more frequently in response to assembly updates or availability of informative data. This article highlights key features and advantages of RefSeq genome annotation products and presents an overview of NCBI processes to generate these data. Further discussion of NCBI's resources highlights useful features and the best methods for accessing our data.

Subject(s)

Amino Acid Sequence/genetics , Databases, Genetic , Databases, Nucleic Acid , Genome , Animals , Internet , Mice

RefSeq curation and annotation of antizyme and antizyme inhibitor genes in vertebrates.

Rajput, Bhanu; Murphy, Terence D; Pruitt, Kim D.

Nucleic Acids Res ; 43(15): 7270-9, 2015 Sep 03.

Article in English | MEDLINE | ID: mdl-26170238

ABSTRACT

Polyamines are ubiquitous cations that are involved in regulating fundamental cellular processes such as cell growth and proliferation; hence, their intracellular concentration is tightly regulated. Antizyme and antizyme inhibitor have a central role in maintaining cellular polyamine levels. Antizyme is unique in that it is expressed via a novel programmed ribosomal frameshifting mechanism. Conventional computational tools are unable to predict a programmed frameshift, resulting in misannotation of antizyme transcripts and proteins on transcript and genomic sequences. Correct annotation of a programmed frameshifting event requires manual evaluation. Our goal was to provide an accurately curated and annotated Reference Sequence (RefSeq) data set of antizyme transcript and protein records across a broad taxonomic scope that would serve as standards for accurate representation of these gene products. As antizyme and antizyme inhibitor proteins are functionally connected, we also curated antizyme inhibitor genes to more fully represent the elegant biology of polyamine regulation. Manual review of genes for three members of the antizyme family and two members of the antizyme inhibitor family in 91 vertebrate organisms resulted in a total of 461 curated RefSeq records.

Subject(s)

Carrier Proteins/genetics , Data Curation , Databases, Genetic , Molecular Sequence Annotation , Proteins/genetics , Vertebrates/genetics , Animals , Frameshifting, Ribosomal , Humans , Mice , Multigene Family , Rats

Current status and new features of the Consensus Coding Sequence database.

Farrell, Catherine M; O'Leary, Nuala A; Harte, Rachel A; Loveland, Jane E; Wilming, Laurens G; Wallin, Craig; Diekhans, Mark; Barrell, Daniel; Searle, Stephen M J; Aken, Bronwen; Hiatt, Susan M; Frankish, Adam; Suner, Marie-Marthe; Rajput, Bhanu; Steward, Charles A; Brown, Garth R; Bennett, Ruth; Murphy, Michael; Wu, Wendy; Kay, Mike P; Hart, Jennifer; Rajan, Jeena; Weber, Janet; Snow, Catherine; Riddick, Lillian D; Hunt, Toby; Webb, David; Thomas, Mark; Tamez, Pamela; Rangwala, Sanjida H; McGarvey, Kelly M; Pujar, Shashikant; Shkeda, Andrei; Mudge, Jonathan M; Gonzalez, Jose M; Gilbert, James G R; Trevanion, Stephen J; Baertsch, Robert; Harrow, Jennifer L; Hubbard, Tim; Ostell, James M; Haussler, David; Pruitt, Kim D.

Nucleic Acids Res ; 42(Database issue): D865-72, 2014 Jan.

Article in English | MEDLINE | ID: mdl-24217909

ABSTRACT

The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.

Subject(s)

Databases, Genetic , Proteins/genetics , Animals , Exons , Genomics , Humans , Internet , Mice , Molecular Sequence Annotation , Sequence Analysis

RefSeq: an update on mammalian reference sequences.

Pruitt, Kim D; Brown, Garth R; Hiatt, Susan M; Thibaud-Nissen, Françoise; Astashyn, Alexander; Ermolaeva, Olga; Farrell, Catherine M; Hart, Jennifer; Landrum, Melissa J; McGarvey, Kelly M; Murphy, Michael R; O'Leary, Nuala A; Pujar, Shashikant; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Shkeda, Andrei; Sun, Hanzhen; Tamez, Pamela; Tully, Raymond E; Wallin, Craig; Webb, David; Weber, Janet; Wu, Wendy; DiCuccio, Michael; Kitts, Paul; Maglott, Donna R; Murphy, Terence D; Ostell, James M.

Nucleic Acids Res ; 42(Database issue): D756-63, 2014 Jan.

Article in English | MEDLINE | ID: mdl-24259432

ABSTRACT

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://www.ncbi.nlm.nih.gov/refseq/). We report here on growth of the mammalian and human subsets, changes to NCBI's eukaryotic annotation pipeline and modifications affecting transcript and protein records. Recent changes to NCBI's eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes. Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest. We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project.

Subject(s)

Databases, Genetic , Genomics , Mammals/genetics , Animals , Eukaryota/genetics , Exons , Genome , Genomics/standards , Humans , Internet , Molecular Sequence Annotation , Proteins/chemistry , Proteins/genetics , RNA/chemistry , Reference Standards

The completion of the Mammalian Gene Collection (MGC).

Temple, Gary; Gerhard, Daniela S; Rasooly, Rebekah; Feingold, Elise A; Good, Peter J; Robinson, Cristen; Mandich, Allison; Derge, Jeffrey G; Lewis, Jeanne; Shoaf, Debonny; Collins, Francis S; Jang, Wonhee; Wagner, Lukas; Shenmen, Carolyn M; Misquitta, Leonie; Schaefer, Carl F; Buetow, Kenneth H; Bonner, Tom I; Yankie, Linda; Ward, Ming; Phan, Lon; Astashyn, Alex; Brown, Garth; Farrell, Catherine; Hart, Jennifer; Landrum, Melissa; Maidak, Bonnie L; Murphy, Michael; Murphy, Terence; Rajput, Bhanu; Riddick, Lillian; Webb, David; Weber, Janet; Wu, Wendy; Pruitt, Kim D; Maglott, Donna; Siepel, Adam; Brejova, Brona; Diekhans, Mark; Harte, Rachel; Baertsch, Robert; Kent, Jim; Haussler, David; Brent, Michael; Langton, Laura; Comstock, Charles L G; Stevens, Michael; Wei, Chaochun; van Baren, Marijke J; Salehi-Ashtiani, Kourosh.

Genome Res ; 19(12): 2324-33, 2009 Dec.

Article in English | MEDLINE | ID: mdl-19767417

ABSTRACT

Since its start, the Mammalian Gene Collection (MGC) has sought to provide at least one full-protein-coding sequence cDNA clone for every human and mouse gene with a RefSeq transcript, and at least 6200 rat genes. The MGC cloning effort initially relied on random expressed sequence tag screening of cDNA libraries. Here, we summarize our recent progress using directed RT-PCR cloning and DNA synthesis. The MGC now contains clones with the entire protein-coding sequence for 92% of human and 89% of mouse genes with curated RefSeq (NM-accession) transcripts, and for 97% of human and 96% of mouse genes with curated RefSeq transcripts that have one or more PubMed publications, in addition to clones for more than 6300 rat genes. These high-quality MGC clones and their sequences are accessible without restriction to researchers worldwide.

Subject(s)

Cloning, Molecular/methods , Computational Biology/methods , DNA, Complementary/genetics , Gene Library , Genes/genetics , Mammals/genetics , Animals , DNA/biosynthesis , Humans , Mice , National Institutes of Health (U.S.) , Rats , Reverse Transcriptase Polymerase Chain Reaction , United States

10.

The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Pruitt, Kim D; Harrow, Jennifer; Harte, Rachel A; Wallin, Craig; Diekhans, Mark; Maglott, Donna R; Searle, Steve; Farrell, Catherine M; Loveland, Jane E; Ruef, Barbara J; Hart, Elizabeth; Suner, Marie-Marthe; Landrum, Melissa J; Aken, Bronwen; Ayling, Sarah; Baertsch, Robert; Fernandez-Banet, Julio; Cherry, Joshua L; Curwen, Val; Dicuccio, Michael; Kellis, Manolis; Lee, Jennifer; Lin, Michael F; Schuster, Michael; Shkeda, Andrew; Amid, Clara; Brown, Garth; Dukhanina, Oksana; Frankish, Adam; Hart, Jennifer; Maidak, Bonnie L; Mudge, Jonathan; Murphy, Michael R; Murphy, Terence; Rajan, Jeena; Rajput, Bhanu; Riddick, Lillian D; Snow, Catherine; Steward, Charles; Webb, David; Weber, Janet A; Wilming, Laurens; Wu, Wenyu; Birney, Ewan; Haussler, David; Hubbard, Tim; Ostell, James; Durbin, Richard; Lipman, David.

Genome Res ; 19(7): 1316-23, 2009 Jul.

Article in English | MEDLINE | ID: mdl-19498102

ABSTRACT

Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.

Subject(s)

Consensus Sequence , Genome , Open Reading Frames/genetics , Animals , Humans , Mice , Sequence Alignment

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL