Búsqueda | Portal Regional de la BVS

1.

Complete Chloroplast Genome of Argania spinosa: Structural Organization and Phylogenetic Relationships in Sapotaceae.

Khayi, Slimane; Gaboun, Fatima; Pirro, Stacy; Tatusova, Tatiana; El Mousadik, Abdelhamid; Ghazal, Hassan; Mentag, Rachid.

Plants (Basel) ; 9(10)2020 Oct 13.

Artículo en Inglés | MEDLINE | ID: mdl-33066261

RESUMEN

Argania spinosa (Sapotaceae), an important endemic Moroccan oil tree, is a primary source of argan oil, which has numerous dietary and medicinal proprieties. The plant species occupies the mid-western part of Morocco and provides great environmental and socioeconomic benefits. The complete chloroplast (cp) genome of A. spinosa was sequenced, assembled, and analyzed in comparison with those of two Sapotaceae members. The A. spinosa cp genome is 158,848 bp long, with an average GC content of 36.8%. The cp genome exhibits a typical quadripartite and circular structure consisting of a pair of inverted regions (IR) of 25,945 bp in length separating small single-copy (SSC) and large single-copy (LSC) regions of 18,591 and 88,367 bp, respectively. The annotation of A. spinosa cp genome predicted 130 genes, including 85 protein-coding genes (CDS), 8 ribosomal RNA (rRNA) genes, and 37 transfer RNA (tRNA) genes. A total of 44 long repeats and 88 simple sequence repeats (SSR) divided into mononucleotides (76), dinucleotides (7), trinucleotides (3), tetranucleotides (1), and hexanucleotides (1) were identified in the A. spinosa cp genome. Phylogenetic analyses using the maximum likelihood (ML) method were performed based on 69 protein-coding genes from 11 species of Ericales. The results confirmed the close position of A. spinosa to the Sideroxylon genus, supporting the revisiting of its taxonomic status. The complete chloroplast genome sequence will be valuable for further studies on the conservation and breeding of this medicinally and culinary important species and also contribute to clarifying the phylogenetic position of the species within Sapotaceae.

2.

First draft genome assembly of the Argane tree ( Argania spinosa).

Khayi, Slimane; Azza, Nour Elhouda; Gaboun, Fatima; Pirro, Stacy; Badad, Oussama; Claros, M Gonzalo; Lightfoot, David A; Unver, Turgay; Chaouni, Bouchra; Merrouch, Redouane; Rahim, Bouchra; Essayeh, Soumaya; Ganoudi, Matika; Abdelwahd, Rabha; Diria, Ghizlane; Mdarhi, Meriem Alaoui; Labhilili, Mustapha; Iraqi, Driss; Mouhaddab, Jamila; Sedrati, Hayat; Memari, Majid; Hamamouch, Noureddine; Alché, Juan de Dios; Boukhatem, Noureddine; Mrabet, Rachid; Dahan, Rachid; Legssyer, Adelkhaleq; Khalfaoui, Mohamed; Badraoui, Mohamed; Van de Peer, Yves; Tatusova, Tatiana; El Mousadik, Abdelhamid; Mentag, Rachid; Ghazal, Hassan.

F1000Res ; 7: 1310, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-32509273

RESUMEN

Background: The Argane tree ( Argania spinosa L. Skeels) is an endemic tree of southwestern Morocco that plays an important socioeconomic and ecologic role for a dense human population in an arid zone. Several studies confirmed the importance of this species as a food and feed source and as a resource for both pharmaceutical and cosmetic compounds. Unfortunately, the argane tree ecosystem is facing significant threats from environmental changes (global warming, over-population) and over-exploitation. Limited research has been conducted, however, on argane tree genetics and genomics, which hinders its conservation and genetic improvement. Methods: Here, we present a draft genome assembly of A. spinosa. A reliable reference genome of A. spinosa was created using a hybrid de novo assembly approach combining short and long sequencing reads. Results: In total, 144 Gb Illumina HiSeq reads and 7.2 Gb PacBio reads were produced and assembled. The final draft genome comprises 75 327 scaffolds totaling 671 Mb with an N50 of 49 916 kb. The draft assembly is close to the genome size estimated by k-mers distribution and covers 89% of complete and 4.3 % of partial Arabidopsis orthologous groups in BUSCO. Conclusion: The A. spinosa genome will be useful for assessing biodiversity leading to efficient conservation of this endangered endemic tree. Furthermore, the genome may enable genome-assisted cultivar breeding, and provide a better understanding of important metabolic pathways and their underlying genes for both cosmetic and pharmacological purposes.

3.

Clustering analysis of proteins from microbial genomes at multiple levels of resolution.

Zaslavsky, Leonid; Ciufo, Stacy; Fedorov, Boris; Tatusova, Tatiana.

BMC Bioinformatics ; 17 Suppl 8: 276, 2016 Aug 31.

Artículo en Inglés | MEDLINE | ID: mdl-27586436

RESUMEN

BACKGROUND: Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy. RESULTS: Protein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering. The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters. CONCLUSION: The developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations.

Asunto(s)

Proteínas Bacterianas/metabolismo , Genoma Microbiano , Algoritmos , Análisis por Conglomerados , Guanosina Trifosfato/metabolismo , Humanos , Filogenia , Estadística como Asunto

4.

NCBI prokaryotic genome annotation pipeline.

Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James.

Nucleic Acids Res ; 44(14): 6614-24, 2016 08 19.

Artículo en Inglés | MEDLINE | ID: mdl-27342282

RESUMEN

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

Asunto(s)

Genoma Bacteriano , Anotación de Secuencia Molecular , Células Procariotas/metabolismo , Bacterias/genética , Proteínas Bacterianas/química , Bases de Datos de Ácidos Nucleicos , Genes Bacterianos

5.

From data repositories to submission portals: rethinking the role of domain-specific databases in CollecTF.

Kiliç, Sefa; Sagitova, Dinara M; Wolfish, Shoshannah; Bely, Benoit; Courtot, Mélanie; Ciufo, Stacy; Tatusova, Tatiana; O'Donovan, Claire; Chibucos, Marcus C; Martin, Maria J; Erill, Ivan.

Database (Oxford) ; 20162016.

Artículo en Inglés | MEDLINE | ID: mdl-27114493

RESUMEN

Domain-specific databases are essential resources for the biomedical community, leveraging expert knowledge to curate published literature and provide access to referenced data and knowledge. The limited scope of these databases, however, poses important challenges on their infrastructure, visibility, funding and usefulness to the broader scientific community. CollecTF is a community-oriented database documenting experimentally validated transcription factor (TF)-binding sites in the Bacteria domain. In its quest to become a community resource for the annotation of transcriptional regulatory elements in bacterial genomes, CollecTF aims to move away from the conventional data-repository paradigm of domain-specific databases. Through the adoption of well-established ontologies, identifiers and collaborations, CollecTF has progressively become also a portal for the annotation and submission of information on transcriptional regulatory elements to major biological sequence resources (RefSeq, UniProtKB and the Gene Ontology Consortium). This fundamental change in database conception capitalizes on the domain-specific knowledge of contributing communities to provide high-quality annotations, while leveraging the availability of stable information hubs to promote long-term access and provide high-visibility to the data. As a submission portal, CollecTF generates TF-binding site information through direct annotation of RefSeq genome records, definition of TF-based regulatory networks in UniProtKB entries and submission of functional annotations to the Gene Ontology. As a database, CollecTF provides enhanced search and browsing, targeted data exports, binding motif analysis tools and integration with motif discovery and search platforms. This innovative approach will allow CollecTF to focus its limited resources on the generation of high-quality information and the provision of specialized access to the data.Database URL: http://www.collectf.org/.

Asunto(s)

Sistemas de Administración de Bases de Datos , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Interfaz Usuario-Computador

6.

Update on Genomic Databases and Resources at the National Center for Biotechnology Information.

Tatusova, Tatiana.

Methods Mol Biol ; 1415: 3-30, 2016.

Artículo en Inglés | MEDLINE | ID: mdl-27115625

RESUMEN

The National Center for Biotechnology Information (NCBI), as a primary public repository of genomic sequence data, collects and maintains enormous amounts of heterogeneous data. Data for genomes, genes, gene expressions, gene variation, gene families, proteins, and protein domains are integrated with the analytical, search, and retrieval resources through the NCBI website, text-based search and retrieval system, provides a fast and easy way to navigate across diverse biological databases.Comparative genome analysis tools lead to further understanding of evolution processes quickening the pace of discovery. Recent technological innovations have ignited an explosion in genome sequencing that has fundamentally changed our understanding of the biology of living organisms. This huge increase in DNA sequence data presents new challenges for the information management system and the visualization tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data.

Asunto(s)

Bases de Datos Genéticas , Análisis de Secuencia de ADN/métodos , Epigenómica , Genómica/métodos , Almacenamiento y Recuperación de la Información , Internet , National Institutes of Health (U.S.) , National Library of Medicine (U.S.) , Estados Unidos

7.

Assembly: a resource for assembled genomes at NCBI.

Kitts, Paul A; Church, Deanna M; Thibaud-Nissen, Françoise; Choi, Jinna; Hem, Vichet; Sapojnikov, Victor; Smith, Robert G; Tatusova, Tatiana; Xiang, Charlie; Zherikov, Andrey; DiCuccio, Michael; Murphy, Terence D; Pruitt, Kim D; Kimchi, Avi.

Nucleic Acids Res ; 44(D1): D73-80, 2016 Jan 04.

Artículo en Inglés | MEDLINE | ID: mdl-26578580

RESUMEN

The NCBI Assembly database (www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project. Users can find assemblies of interest by querying the Assembly Resource directly or by browsing available assemblies for a particular organism. Links in the Assembly Resource allow users to easily download sequence and annotations for current versions of genome assemblies from the NCBI genomes FTP site.

Asunto(s)

Bases de Datos de Ácidos Nucleicos , Genómica , Animales , Genoma , Humanos , Internet , Ratones

8.

Gene: a gene-centered information resource at NCBI.

Brown, Garth R; Hem, Vichet; Katz, Kenneth S; Ovetsky, Michael; Wallin, Craig; Ermolaeva, Olga; Tolstoy, Igor; Tatusova, Tatiana; Pruitt, Kim D; Maglott, Donna R; Murphy, Terence D.

Nucleic Acids Res ; 43(Database issue): D36-42, 2015 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-25355515

RESUMEN

The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities and Entrez Direct) and for bulk transfer by FTP.

Asunto(s)

Bases de Datos Genéticas , Genes , Variación Genética , Genómica , Internet , National Library of Medicine (U.S.) , Fenotipo , Estados Unidos

9.

Update on RefSeq microbial genomes resources.

Tatusova, Tatiana; Ciufo, Stacy; Federhen, Scott; Fedorov, Boris; McVeigh, Richard; O'Neill, Kathleen; Tolstoy, Igor; Zaslavsky, Leonid.

Nucleic Acids Res ; 43(Database issue): D599-605, 2015 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-25510495

RESUMEN

NCBI RefSeq genome collection http://www.ncbi.nlm.nih.gov/genome represents all three major domains of life: Eukarya, Bacteria and Archaea as well as Viruses. Prokaryotic genome sequences are the most rapidly growing part of the collection. During the year of 2014 more than 10,000 microbial genome assemblies have been publicly released bringing the total number of prokaryotic genomes close to 30,000. We continue to improve the quality and usability of the microbial genome resources by providing easy access to the data and the results of the pre-computed analysis, and improving analysis and visualization tools. A number of improvements have been incorporated into the Prokaryotic Genome Annotation Pipeline. Several new features have been added to RefSeq prokaryotic genomes data processing pipeline including the calculation of genome groups (clades) and the optimization of protein clusters generation using pan-genome approach.

Asunto(s)

Bases de Datos de Ácidos Nucleicos , Genoma Arqueal , Genoma Bacteriano , Internet , Anotación de Secuencia Molecular

10.

Genomic standards consortium projects.

Field, Dawn; Sterk, Peter; Kottmann, Renzo; De Smet, J Wim; Amaral-Zettler, Linda; Cochrane, Guy; Cole, James R; Davies, Neil; Dawyndt, Peter; Garrity, George M; Gilbert, Jack A; Glöckner, Frank Oliver; Hirschman, Lynette; Klenk, Hans-Peter; Knight, Rob; Kyrpides, Nikos; Meyer, Folker; Karsch-Mizrachi, Ilene; Morrison, Norman; Robbins, Robert; San Gil, Inigo; Sansone, Susanna; Schriml, Lynn; Tatusova, Tatiana; Ussery, Dave; Yilmaz, Pelin; White, Owen; Wooley, John; Caporaso, Gregory.

Stand Genomic Sci ; 9(3): 599-601, 2014 Jun 15.

Artículo en Inglés | MEDLINE | ID: mdl-25197446

RESUMEN

The Genomic Standards Consortium (GSC) is an open-membership community that was founded in 2005 to work towards the development, implementation and harmonization of standards in the field of genomics. Starting with the defined task of establishing a minimal set of descriptions the GSC has evolved into an active standards-setting body that currently has 18 ongoing projects, with additional projects regularly proposed from within and outside the GSC. Here we describe our recently enacted policy for proposing new activities that are intended to be taken on by the GSC, along with the template for proposing such new activities.

11.

Improvements to pairwise sequence comparison (PASC): a genome-based web tool for virus classification.

Bao, Yiming; Chetvernin, Vyacheslav; Tatusova, Tatiana.

Arch Virol ; 159(12): 3293-304, 2014 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-25119676

RESUMEN

The number of viral genome sequences in the public databases is increasing dramatically, and these sequences are playing an important role in virus classification. Pairwise sequence comparison is a sequence-based virus classification method. A program using this method calculates the pairwise identities of virus sequences within a virus family and displays their distribution, and visual analysis helps to determine demarcations at different taxonomic levels such as strain, species, genus and subfamily. Subsequent comparison of new sequences against existing ones allows viruses from which the new sequences were derived to be classified. Although this method cannot be used as the only criterion for virus classification in some cases, it is a quantitative method and has many advantages over conventional virus classification methods. It has been applied to several virus families, and there is an increasing interest in using this method for other virus families/groups. The Pairwise Sequence Comparison (PASC) classification tool was created at the National Center for Biotechnology Information. The tool's database stores pairwise identities for complete genomes/segments of 56 virus families/groups. Data in the system are updated every day to reflect changes in virus taxonomy and additions of new virus sequences to the public database. The web interface of the tool ( http://www.ncbi.nlm.nih.gov/sutils/pasc/ ) makes it easy to navigate and perform analyses. Multiple new viral genome sequences can be tested simultaneously with this system to suggest the taxonomic position of virus isolates in a specific family. PASC eliminates potential discrepancies in the results caused by different algorithms and/or different data used by researchers.

Asunto(s)

Biología Computacional/métodos , Genoma Viral , Internet , Homología de Secuencia , Virus/clasificación , Virus/genética , National Institutes of Health (U.S.) , Estados Unidos

12.

Virus Variation Resource--recent updates and future directions.

Brister, J Rodney; Bao, Yiming; Zhdanov, Sergey A; Ostapchuck, Yuri; Chetvernin, Vyacheslav; Kiryutin, Boris; Zaslavsky, Leonid; Kimelman, Michael; Tatusova, Tatiana A.

Nucleic Acids Res ; 42(Database issue): D660-5, 2014 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-24304891

RESUMEN

Virus Variation (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/) is a comprehensive, web-based resource designed to support the retrieval and display of large virus sequence datasets. The resource includes a value added database, a specialized search interface and a suite of sequence data displays. Virus-specific sequence annotation and database loading pipelines produce consistent protein and gene annotation and capture sequence descriptors from sequence records then map these metadata to a controlled vocabulary. The database supports a metadata driven, web-based search interface where sequences can be selected using a variety of biological and clinical criteria. Retrieved sequences can then be downloaded in a variety of formats or analyzed using a suite of tools and displays. Over the past 2 years, the pre-existing influenza and Dengue virus resources have been combined into a single construct and West Nile virus added to the resultant resource. A number of improvements were incorporated into the sequence annotation and database loading pipelines, and the virus-specific search interfaces were updated to support more advanced functions. Several new features have also been added to the sequence download options, and a new multiple sequence alignment viewer has been incorporated into the resource tool set. Together these enhancements should support enhanced usability and the inclusion of new viruses in the future.

Asunto(s)

Bases de Datos Genéticas , Virus/genética , Genes Virales , Genoma Viral , Genómica , Internet , Anotación de Secuencia Molecular , Orthomyxoviridae/genética , Alineación de Secuencia , Proteínas Virales

13.

RefSeq microbial genomes database: new representation and annotation strategy.

Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O'Neill, Kathleen; Tolstoy, Igor.

Nucleic Acids Res ; 42(Database issue): D553-9, 2014 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-24316578

RESUMEN

The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.

Asunto(s)

Bases de Datos Genéticas , Genoma Microbiano , Anotación de Secuencia Molecular , Proteínas Bacterianas/genética , Genoma Bacteriano , Genómica/normas , Internet , Estándares de Referencia

14.

PAirwise Sequence Comparison (PASC) and its application in the classification of filoviruses.

Bao, Yiming; Chetvernin, Vyacheslav; Tatusova, Tatiana.

Viruses ; 4(8): 1318-27, 2012 08.

Artículo en Inglés | MEDLINE | ID: mdl-23012628

RESUMEN

PAirwise Sequence Comparison (PASC) is a tool that uses genome sequence similarity to help with virus classification. The PASC tool at NCBI uses two methods: local alignment based on BLAST and global alignment based on Needleman-Wunsch algorithm. It works for complete genomes of viruses of several families/groups, and for the family of Filoviridae, it currently includes 52 complete genomes available in GenBank. It has been shown that BLAST-based alignment approach works better for filoviruses, and therefore is recommended for establishing taxon demarcations criteria. When more genome sequences with high divergence become available, these demarcation will most likely become more precise. The tool can compare new genome sequences of filoviruses with the ones already in the database, and propose their taxonomic classification.

Asunto(s)

Filoviridae/clasificación , Alineación de Secuencia/métodos , Secuencia de Bases , Bases de Datos de Ácidos Nucleicos , Filoviridae/química , Filoviridae/genética , Datos de Secuencia Molecular , Alineación de Secuencia/instrumentación , Análisis de Secuencia de ADN , Programas Informáticos

15.

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy.

Pruitt, Kim D; Tatusova, Tatiana; Brown, Garth R; Maglott, Donna R.

Nucleic Acids Res ; 40(Database issue): D130-5, 2012 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-22121212

RESUMEN

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration. The database includes over 16,00 organisms, 2.4 × 0(6) genomic records, 13 × 10(6) proteins and 2 × 10(6) RNA records spanning prokaryotes, eukaryotes and viruses (RefSeq release 49, September 2011). The RefSeq database is maintained by a combined approach of automated analyses, collaboration and manual curation to generate an up-to-date representation of the sequence, its features, names and cross-links to related sources of information. We report here on recent growth, the status of curating the human RefSeq data set, more extensive feature annotation and current policy for eukaryotic genome annotation via the NCBI annotation pipeline. More information about the resource is available online (see http://www.ncbi.nlm.nih.gov/RefSeq/).

Asunto(s)

Bases de Datos Genéticas , Anotación de Secuencia Molecular , Análisis de Secuencia/normas , Genómica/normas , Humanos , Estándares de Referencia , Análisis de Secuencia de ADN/normas , Análisis de Secuencia de Proteína/normas , Análisis de Secuencia de ARN/normas

16.

BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.

Barrett, Tanya; Clark, Karen; Gevorgyan, Robert; Gorelenkov, Vyacheslav; Gribov, Eugene; Karsch-Mizrachi, Ilene; Kimelman, Michael; Pruitt, Kim D; Resenchuk, Sergei; Tatusova, Tatiana; Yaschenko, Eugene; Ostell, James.

Nucleic Acids Res ; 40(Database issue): D57-63, 2012 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-22139929

RESUMEN

As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability. Concomitantly, the BioSample database is being developed to capture descriptive information about the biological samples investigated in projects. BioProject and BioSample records link to corresponding data stored in archival repositories. Submissions are supported by a web-based Submission Portal that guides users through a series of forms for input of rich metadata describing their projects and samples. Together, these databases offer improved ways for users to query, locate, integrate and interpret the masses of data held in NCBI's archival repositories. The BioProject and BioSample databases are available at http://www.ncbi.nlm.nih.gov/bioproject and http://www.ncbi.nlm.nih.gov/biosample, respectively.

Asunto(s)

Bases de Datos Genéticas , Genómica , Internet , Integración de Sistemas , Transcriptoma , Interfaz Usuario-Computador

17.

Database resources of the National Center for Biotechnology Information.

Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; Dicuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Krasnov, Sergey; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Karsch-Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; Wilbur, W John; Yaschenko, Eugene; Ye, Jian.

Nucleic Acids Res ; 40(Database issue): D13-25, 2012 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-22140104

RESUMEN

In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.

Asunto(s)

Bases de Datos como Asunto , Bases de Datos Genéticas , Bases de Datos de Proteínas , Expresión Génica , Genómica , Internet , Modelos Moleculares , National Library of Medicine (U.S.) , Publicaciones Periódicas como Asunto , PubMed , Alineación de Secuencia , Análisis de Secuencia de ADN , Análisis de Secuencia de Proteína , Análisis de Secuencia de ARN , Bibliotecas de Moléculas Pequeñas , Estados Unidos

18.

Solving the Problem: Genome Annotation Standards before the Data Deluge.

Klimke, William; O'Donovan, Claire; White, Owen; Brister, J Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D; Tatusova, Tatiana.

Stand Genomic Sci ; 5(1): 168-93, 2011 Oct 15.

Artículo en Inglés | MEDLINE | ID: mdl-22180819

RESUMEN

The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.

19.

The Genomic Standards Consortium.

Field, Dawn; Amaral-Zettler, Linda; Cochrane, Guy; Cole, James R; Dawyndt, Peter; Garrity, George M; Gilbert, Jack; Glöckner, Frank Oliver; Hirschman, Lynette; Karsch-Mizrachi, Ilene; Klenk, Hans-Peter; Knight, Rob; Kottmann, Renzo; Kyrpides, Nikos; Meyer, Folker; San Gil, Inigo; Sansone, Susanna-Assunta; Schriml, Lynn M; Sterk, Peter; Tatusova, Tatiana; Ussery, David W; White, Owen; Wooley, John.

PLoS Biol ; 9(6): e1001088, 2011 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-21713030

RESUMEN

A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.

Asunto(s)

Bases de Datos Genéticas , Genómica/normas , Cooperación Internacional , Metagenoma

20.

Cryptic splice sites and split genes.

Kapustin, Yuri; Chan, Elcie; Sarkar, Rupa; Wong, Frederick; Vorechovsky, Igor; Winston, Robert M; Tatusova, Tatiana; Dibb, Nick J.

Nucleic Acids Res ; 39(14): 5837-44, 2011 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-21470962

RESUMEN

We describe a new program called cryptic splice finder (CSF) that can reliably identify cryptic splice sites (css), so providing a useful tool to help investigate splicing mutations in genetic disease. We report that many css are not entirely dormant and are often already active at low levels in normal genes prior to their enhancement in genetic disease. We also report a fascinating correlation between the positions of css and introns, whereby css within the exons of one species frequently match the exact position of introns in equivalent genes from another species. These results strongly indicate that many introns were inserted into css during evolution and they also imply that the splicing information that lies outside some introns can be independently recognized by the splicing machinery and was in place prior to intron insertion. This indicates that non-intronic splicing information had a key role in shaping the split structure of eukaryote genes.

Asunto(s)

Sitios de Empalme de ARN , Programas Informáticos , Secuencia de Bases , Secuencia de Consenso , Evolución Molecular , Etiquetas de Secuencia Expresada/química , Genes , Enfermedades Genéticas Congénitas/genética , Genómica/métodos , Humanos , Intrones , Alineación de Secuencia , Análisis de Secuencia de Proteína

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA