Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 65
Filter
1.
Nucleic Acids Res ; 44(14): 6614-24, 2016 08 19.
Article in English | MEDLINE | ID: mdl-27342282

ABSTRACT

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.


Subject(s)
Genome, Bacterial , Molecular Sequence Annotation , Prokaryotic Cells/metabolism , Bacteria/genetics , Bacterial Proteins/chemistry , Databases, Nucleic Acid , Genes, Bacterial
2.
Nucleic Acids Res ; 44(D1): D73-80, 2016 Jan 04.
Article in English | MEDLINE | ID: mdl-26578580

ABSTRACT

The NCBI Assembly database (www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project. Users can find assemblies of interest by querying the Assembly Resource directly or by browsing available assemblies for a particular organism. Links in the Assembly Resource allow users to easily download sequence and annotations for current versions of genome assemblies from the NCBI genomes FTP site.


Subject(s)
Databases, Nucleic Acid , Genomics , Animals , Genome , Humans , Internet , Mice
3.
Nucleic Acids Res ; 43(Database issue): D599-605, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25510495

ABSTRACT

NCBI RefSeq genome collection http://www.ncbi.nlm.nih.gov/genome represents all three major domains of life: Eukarya, Bacteria and Archaea as well as Viruses. Prokaryotic genome sequences are the most rapidly growing part of the collection. During the year of 2014 more than 10,000 microbial genome assemblies have been publicly released bringing the total number of prokaryotic genomes close to 30,000. We continue to improve the quality and usability of the microbial genome resources by providing easy access to the data and the results of the pre-computed analysis, and improving analysis and visualization tools. A number of improvements have been incorporated into the Prokaryotic Genome Annotation Pipeline. Several new features have been added to RefSeq prokaryotic genomes data processing pipeline including the calculation of genome groups (clades) and the optimization of protein clusters generation using pan-genome approach.


Subject(s)
Databases, Nucleic Acid , Genome, Archaeal , Genome, Bacterial , Internet , Molecular Sequence Annotation
4.
Nucleic Acids Res ; 43(Database issue): D36-42, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25355515

ABSTRACT

The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities and Entrez Direct) and for bulk transfer by FTP.


Subject(s)
Databases, Genetic , Genes , Genetic Variation , Genomics , Internet , National Library of Medicine (U.S.) , Phenotype , United States
5.
BMC Bioinformatics ; 17 Suppl 8: 276, 2016 Aug 31.
Article in English | MEDLINE | ID: mdl-27586436

ABSTRACT

BACKGROUND: Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy. RESULTS: Protein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering. The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters. CONCLUSION: The developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations.


Subject(s)
Bacterial Proteins/metabolism , Genome, Microbial , Algorithms , Cluster Analysis , Guanosine Triphosphate/metabolism , Humans , Phylogeny , Statistics as Topic
6.
Nucleic Acids Res ; 42(Database issue): D553-9, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24316578

ABSTRACT

The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.


Subject(s)
Databases, Genetic , Genome, Microbial , Molecular Sequence Annotation , Bacterial Proteins/genetics , Genome, Bacterial , Genomics/standards , Internet , Reference Standards
7.
Nucleic Acids Res ; 42(Database issue): D660-5, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24304891

ABSTRACT

Virus Variation (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/) is a comprehensive, web-based resource designed to support the retrieval and display of large virus sequence datasets. The resource includes a value added database, a specialized search interface and a suite of sequence data displays. Virus-specific sequence annotation and database loading pipelines produce consistent protein and gene annotation and capture sequence descriptors from sequence records then map these metadata to a controlled vocabulary. The database supports a metadata driven, web-based search interface where sequences can be selected using a variety of biological and clinical criteria. Retrieved sequences can then be downloaded in a variety of formats or analyzed using a suite of tools and displays. Over the past 2 years, the pre-existing influenza and Dengue virus resources have been combined into a single construct and West Nile virus added to the resultant resource. A number of improvements were incorporated into the sequence annotation and database loading pipelines, and the virus-specific search interfaces were updated to support more advanced functions. Several new features have also been added to the sequence download options, and a new multiple sequence alignment viewer has been incorporated into the resource tool set. Together these enhancements should support enhanced usability and the inclusion of new viruses in the future.


Subject(s)
Databases, Genetic , Viruses/genetics , Genes, Viral , Genome, Viral , Genomics , Internet , Molecular Sequence Annotation , Orthomyxoviridae/genetics , Sequence Alignment , Viral Proteins
8.
PLoS Biol ; 9(6): e1001088, 2011 Jun.
Article in English | MEDLINE | ID: mdl-21713030

ABSTRACT

A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.


Subject(s)
Databases, Genetic , Genomics/standards , International Cooperation , Metagenome
9.
Arch Virol ; 159(12): 3293-304, 2014 Dec.
Article in English | MEDLINE | ID: mdl-25119676

ABSTRACT

The number of viral genome sequences in the public databases is increasing dramatically, and these sequences are playing an important role in virus classification. Pairwise sequence comparison is a sequence-based virus classification method. A program using this method calculates the pairwise identities of virus sequences within a virus family and displays their distribution, and visual analysis helps to determine demarcations at different taxonomic levels such as strain, species, genus and subfamily. Subsequent comparison of new sequences against existing ones allows viruses from which the new sequences were derived to be classified. Although this method cannot be used as the only criterion for virus classification in some cases, it is a quantitative method and has many advantages over conventional virus classification methods. It has been applied to several virus families, and there is an increasing interest in using this method for other virus families/groups. The Pairwise Sequence Comparison (PASC) classification tool was created at the National Center for Biotechnology Information. The tool's database stores pairwise identities for complete genomes/segments of 56 virus families/groups. Data in the system are updated every day to reflect changes in virus taxonomy and additions of new virus sequences to the public database. The web interface of the tool ( http://www.ncbi.nlm.nih.gov/sutils/pasc/ ) makes it easy to navigate and perform analyses. Multiple new viral genome sequences can be tested simultaneously with this system to suggest the taxonomic position of virus isolates in a specific family. PASC eliminates potential discrepancies in the results caused by different algorithms and/or different data used by researchers.


Subject(s)
Computational Biology/methods , Genome, Viral , Internet , Sequence Homology , Viruses/classification , Viruses/genetics , National Institutes of Health (U.S.) , United States
10.
Nucleic Acids Res ; 40(Database issue): D130-5, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22121212

ABSTRACT

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).


Subject(s)
Databases, Genetic , Molecular Sequence Annotation , Sequence Analysis/standards , Genomics/standards , Humans , Reference Standards , Sequence Analysis, DNA/standards , Sequence Analysis, Protein/standards , Sequence Analysis, RNA/standards
11.
Nucleic Acids Res ; 40(Database issue): D57-63, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22139929

ABSTRACT

As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability. Concomitantly, the BioSample database is being developed to capture descriptive information about the biological samples investigated in projects. BioProject and BioSample records link to corresponding data stored in archival repositories. Submissions are supported by a web-based Submission Portal that guides users through a series of forms for input of rich metadata describing their projects and samples. Together, these databases offer improved ways for users to query, locate, integrate and interpret the masses of data held in NCBI's archival repositories. The BioProject and BioSample databases are available at http://www.ncbi.nlm.nih.gov/bioproject and http://www.ncbi.nlm.nih.gov/biosample, respectively.


Subject(s)
Databases, Genetic , Genomics , Internet , Systems Integration , Transcriptome , User-Computer Interface
12.
Nucleic Acids Res ; 40(Database issue): D13-25, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22140104

ABSTRACT

In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.


Subject(s)
Databases as Topic , Databases, Genetic , Databases, Protein , Gene Expression , Genomics , Internet , Models, Molecular , National Library of Medicine (U.S.) , Periodicals as Topic , PubMed , Sequence Alignment , Sequence Analysis, DNA , Sequence Analysis, Protein , Sequence Analysis, RNA , Small Molecule Libraries , United States
13.
Nucleic Acids Res ; 39(Database issue): D52-7, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21115458

ABSTRACT

Entrez Gene (http://www.ncbi.nlm.nih.gov/gene) is National Center for Biotechnology Information (NCBI)'s database for gene-specific information. Entrez Gene maintains records from genomes which have been completely sequenced, which have an active research community to submit gene-specific information, or which are scheduled for intense sequence analysis. The content represents the integration of curation and automated processing from NCBI's Reference Sequence project (RefSeq), collaborating model organism databases, consortia such as Gene Ontology and other databases within NCBI. Records in Entrez Gene are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, genomic location, gene products and their attributes, markers, phenotypes and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities) and for bulk transfer by FTP.


Subject(s)
Databases, Genetic , Genes , Genomics , Internet , National Library of Medicine (U.S.) , United States , User-Computer Interface
14.
Nucleic Acids Res ; 39(14): 5837-44, 2011 Aug.
Article in English | MEDLINE | ID: mdl-21470962

ABSTRACT

We describe a new program called cryptic splice finder (CSF) that can reliably identify cryptic splice sites (css), so providing a useful tool to help investigate splicing mutations in genetic disease. We report that many css are not entirely dormant and are often already active at low levels in normal genes prior to their enhancement in genetic disease. We also report a fascinating correlation between the positions of css and introns, whereby css within the exons of one species frequently match the exact position of introns in equivalent genes from another species. These results strongly indicate that many introns were inserted into css during evolution and they also imply that the splicing information that lies outside some introns can be independently recognized by the splicing machinery and was in place prior to intron insertion. This indicates that non-intronic splicing information had a key role in shaping the split structure of eukaryote genes.


Subject(s)
RNA Splice Sites , Software , Base Sequence , Consensus Sequence , Evolution, Molecular , Expressed Sequence Tags/chemistry , Genes , Genetic Diseases, Inborn/genetics , Genomics/methods , Humans , Introns , Sequence Alignment , Sequence Analysis, Protein
15.
Nucleic Acids Res ; 39(Database issue): D7-10, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21097465

ABSTRACT

The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.


Subject(s)
Databases, Factual/standards , Information Dissemination
16.
Nucleic Acids Res ; 39(Database issue): D38-51, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21097890

ABSTRACT

In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Electronic PCR, OrfFinder, Splign, ProSplign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), IBIS, Biosystems, Peptidome, OMSSA, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.


Subject(s)
Databases, Genetic , Databases, Protein , Gene Expression , Genomics , National Library of Medicine (U.S.) , Protein Structure, Tertiary , PubMed , Sequence Alignment , Sequence Analysis, DNA , Sequence Analysis, RNA , Software , Systems Integration , United States
17.
Nucleic Acids Res ; 38(Database issue): D5-16, 2010 Jan.
Article in English | MEDLINE | ID: mdl-19910364

ABSTRACT

In addition to maintaining the GenBank nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, Reference Sequence, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Entrez Probe, GENSAT, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Peptidome, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.


Subject(s)
Computational Biology/methods , Databases, Genetic , Databases, Nucleic Acid , Algorithms , Animals , Computational Biology/trends , Databases, Protein , Genome, Bacterial , Genome, Viral , Humans , Information Storage and Retrieval/methods , Internet , National Institutes of Health (U.S.) , National Library of Medicine (U.S.) , Software , United States
18.
Nature ; 437(7062): 1162-6, 2005 Oct 20.
Article in English | MEDLINE | ID: mdl-16208317

ABSTRACT

Influenza viruses are remarkably adept at surviving in the human population over a long timescale. The human influenza A virus continues to thrive even among populations with widespread access to vaccines, and continues to be a major cause of morbidity and mortality. The virus mutates from year to year, making the existing vaccines ineffective on a regular basis, and requiring that new strains be chosen for a new vaccine. Less-frequent major changes, known as antigenic shift, create new strains against which the human population has little protective immunity, thereby causing worldwide pandemics. The most recent pandemics include the 1918 'Spanish' flu, one of the most deadly outbreaks in recorded history, which killed 30-50 million people worldwide, the 1957 'Asian' flu, and the 1968 'Hong Kong' flu. Motivated by the need for a better understanding of influenza evolution, we have developed flexible protocols that make it possible to apply large-scale sequencing techniques to the highly variable influenza genome. Here we report the results of sequencing 209 complete genomes of the human influenza A virus, encompassing a total of 2,821,103 nucleotides. In addition to increasing markedly the number of publicly available, complete influenza virus genomes, we have discovered several anomalies in these first 209 genomes that demonstrate the dynamic nature of influenza transmission and evolution. This new, large-scale sequencing effort promises to provide a more comprehensive picture of the evolution of influenza viruses and of their pattern of transmission through human and animal populations. All data from this project are being deposited, without delay, in public archives.


Subject(s)
Evolution, Molecular , Genome, Viral , Influenza A virus/genetics , Influenza, Human/virology , Mutagenesis/genetics , Animals , Hemagglutinin Glycoproteins, Influenza Virus/genetics , Hemagglutinin Glycoproteins, Influenza Virus/immunology , History, 20th Century , History, 21st Century , Humans , Influenza A virus/classification , Influenza A virus/isolation & purification , Influenza A virus/physiology , Influenza Vaccines/history , Influenza Vaccines/immunology , Influenza, Human/epidemiology , Influenza, Human/transmission , Influenza, Human/veterinary , Mutation/genetics , Neuraminidase/genetics , Neuraminidase/metabolism , New York/epidemiology , Phylogeny , Public Sector , Reassortant Viruses/genetics , Sequence Analysis , Time Factors , Virus Replication
19.
Nucleic Acids Res ; 37(Database issue): D32-6, 2009 Jan.
Article in English | MEDLINE | ID: mdl-18927115

ABSTRACT

NCBI's Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) is a curated non-redundant collection of sequences representing genomes, transcripts and proteins. RefSeq records integrate information from multiple sources and represent a current description of the sequence, the gene and sequence features. The database includes over 5300 organisms spanning prokaryotes, eukaryotes and viruses, with records for more than 5.5 x 10(6) proteins (RefSeq release 30). Feature annotation is applied by a combination of curation, collaboration, propagation from other sources and computation. We report here on the recent growth of the database, recent changes to feature annotations and record types for eukaryotic (primarily vertebrate) species and policies regarding species inclusion and genome annotation. In addition, we introduce RefSeqGene, a new initiative to support reporting variation data on a stable genomic coordinate system.


Subject(s)
Databases, Genetic , Sequence Analysis/standards , Animals , Exons , Genomics/standards , Humans , Mice , Proteins/chemistry , Pseudogenes , RNA, Untranslated/chemistry , Reference Standards
20.
Nucleic Acids Res ; 37(Database issue): D216-23, 2009 Jan.
Article in English | MEDLINE | ID: mdl-18940865

ABSTRACT

Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.


Subject(s)
Databases, Protein , Proteins/classification , Cluster Analysis , Genomics , Proteins/chemistry , Proteins/genetics , Sequence Homology, Amino Acid
SELECTION OF CITATIONS
SEARCH DETAIL