Search | VHL Regional Portal

Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation.

Schäffer, Alejandro A; McVeigh, Richard; Robbertse, Barbara; Schoch, Conrad L; Johnston, Anjanette; Underwood, Beverly A; Karsch-Mizrachi, Ilene; Nawrocki, Eric P.

BMC Bioinformatics ; 22(1): 400, 2021 Aug 12.

Article in English | MEDLINE | ID: mdl-34384346

ABSTRACT

BACKGROUND: The DNA sequences encoding ribosomal RNA genes (rRNAs) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. The 16S small subunit ribosomal RNA (SSU rRNA) gene is typically used to identify bacterial and archaeal species. The nuclear 18S SSU rRNA gene, and 28S large subunit (LSU) rRNA gene have been used as DNA barcodes and for phylogenetic studies in different eukaryote taxonomic groups. Because of their popularity, the National Center for Biotechnology Information (NCBI) receives a disproportionate number of rRNA sequence submissions and BLAST queries. These sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron. RESULTS: To improve the timely verification of quality, origin and loci boundaries, we developed Ribovore, a software package for sequence analysis of rRNA sequences. The ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal SSU rRNA. The ribodbmaker program is used to create high-quality datasets of rRNAs from different taxonomic groups. Key algorithmic steps include comparing candidate sequences against rRNA sequence profile hidden Markov models (HMMs) and covariance models of rRNA sequence and secondary-structure conservation, as well as other tests. Nine freely available blastn rRNA databases created and maintained with Ribovore are used for checking incoming GenBank submissions and used by the blastn browser interface at NCBI. Since 2018, Ribovore has been used to analyze more than 50 million prokaryotic SSU rRNA sequences submitted to GenBank, and to select at least 10,435 fungal rRNA RefSeq records from type material of 8350 taxa. CONCLUSION: Ribovore combines single-sequence and profile-based methods to improve GenBank processing and analysis of rRNA sequences. It is a standalone, portable, and extensible software package for the alignment, classification and validation of rRNA sequences. Researchers planning on submitting SSU rRNA sequences to GenBank are encouraged to download and use Ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into GenBank.

Subject(s)

Databases, Nucleic Acid , RNA, Ribosomal , DNA, Ribosomal , Phylogeny , RNA, Ribosomal, 16S/genetics , RNA, Ribosomal, 18S/genetics , Sequence Analysis, RNA

NCBI Taxonomy: a comprehensive update on curation, resources and tools.

Schoch, Conrad L; Ciufo, Stacy; Domrachev, Mikhail; Hotton, Carol L; Kannan, Sivakumar; Khovanskaya, Rogneda; Leipe, Detlef; Mcveigh, Richard; O'Neill, Kathleen; Robbertse, Barbara; Sharma, Shobha; Soussov, Vladimir; Sullivan, John P; Sun, Lu; Turner, Seán; Karsch-Mizrachi, Ilene.

Database (Oxford) ; 20202020 01 01.

Article in English | MEDLINE | ID: mdl-32761142

ABSTRACT

The National Center for Biotechnology Information (NCBI) Taxonomy includes organism names and classifications for every sequence in the nucleotide and protein sequence databases of the International Nucleotide Sequence Database Collaboration. Since the last review of this resource in 2012, it has undergone several improvements. Most notable is the shift from a single SQL database to a series of linked databases tied to a framework of data called NameBank. This means that relations among data elements can be adjusted in more detail, resulting in expanded annotation of synonyms, the ability to flag names with specific nomenclatural properties, enhanced tracking of publications tied to names and improved annotation of scientific authorities and types. Additionally, practices utilized by NCBI Taxonomy curators specific to major taxonomic groups are described, terms peculiar to NCBI Taxonomy are explained, external resources are acknowledged and updates to tools and other resources are documented. Database URL: https://www.ncbi.nlm.nih.gov/taxonomy.

Subject(s)

Classification , Database Management Systems , Databases, Genetic , Animals , Bacteria/genetics , Humans , National Library of Medicine (U.S.) , Plants/genetics , United States , Viruses/genetics

VecScreen_plus_taxonomy: imposing a tax(onomy) increase on vector contamination screening.

Schäffer, Alejandro A; Nawrocki, Eric P; Choi, Yoon; Kitts, Paul A; Karsch-Mizrachi, Ilene; McVeigh, Richard.

Bioinformatics ; 34(5): 755-759, 2018 03 01.

Article in English | MEDLINE | ID: mdl-29069347

ABSTRACT

Motivation: Nucleic acid sequences in public databases should not contain vector contamination, but many sequences in GenBank do (or did) contain vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted sequences for contamination. Additional tools are needed to distinguish true-positive (contamination) from false-positive (not contamination) VecScreen matches. Results: A principal reason for false-positive VecScreen matches is that the sequence and the matching vector subsequence originate from closely related or identical organisms (for example, both originate in Escherichia coli). We collected information on the taxonomy of sources of vector segments in the UniVec database used by VecScreen. We used that information in two overlapping software pipelines for retrospective analysis of contamination in GenBank and for prospective analysis of contamination in new sequence submissions. Using the retrospective pipeline, we identified and corrected over 8000 contaminated sequences in the nonredundant nucleotide database. The prospective analysis pipeline has been in production use since April 2017 to evaluate some new GenBank submissions. Availability and implementation: Data on the sources of UniVec entries were included in release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). The main software is freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy. Contact: aschaffe@helix.nih.gov. Supplementary information: Supplementary data are available at Bioinformatics online.

Subject(s)

Databases, Nucleic Acid/standards , Sequence Analysis, DNA/methods , Software , Bacteria , Eukaryota

Update on RefSeq microbial genomes resources.

Tatusova, Tatiana; Ciufo, Stacy; Federhen, Scott; Fedorov, Boris; McVeigh, Richard; O'Neill, Kathleen; Tolstoy, Igor; Zaslavsky, Leonid.

Nucleic Acids Res ; 43(Database issue): D599-605, 2015 Jan.

Article in English | MEDLINE | ID: mdl-25510495

ABSTRACT

NCBI RefSeq genome collection http://www.ncbi.nlm.nih.gov/genome represents all three major domains of life: Eukarya, Bacteria and Archaea as well as Viruses. Prokaryotic genome sequences are the most rapidly growing part of the collection. During the year of 2014 more than 10,000 microbial genome assemblies have been publicly released bringing the total number of prokaryotic genomes close to 30,000. We continue to improve the quality and usability of the microbial genome resources by providing easy access to the data and the results of the pre-computed analysis, and improving analysis and visualization tools. A number of improvements have been incorporated into the Prokaryotic Genome Annotation Pipeline. Several new features have been added to RefSeq prokaryotic genomes data processing pipeline including the calculation of genome groups (clades) and the optimization of protein clusters generation using pan-genome approach.

Subject(s)

Databases, Nucleic Acid , Genome, Archaeal , Genome, Bacterial , Internet , Molecular Sequence Annotation

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL