Pesquisa | BVS Doenças Infecciosas e Parasitárias

1.

The genome of the colonial hydroid Hydractinia reveals that their stem cells use a toolkit of evolutionarily shared genes with all animals.

Schnitzler, Christine E; Chang, E Sally; Waletich, Justin; Quiroga-Artigas, Gonzalo; Wong, Wai Yee; Nguyen, Anh-Dao; Barreira, Sofia N; Doonan, Liam B; Gonzalez, Paul; Koren, Sergey; Gahan, James M; Sanders, Steven M; Bradshaw, Brian; DuBuc, Timothy Q; de Jong, Danielle; Nawrocki, Eric P; Larson, Alexandra; Klasfeld, Samantha; Gornik, Sebastian G; Moreland, R Travis; Wolfsberg, Tyra G; Phillippy, Adam M; Mullikin, James C; Simakov, Oleg; Cartwright, Paulyn; Nicotra, Matthew; Frank, Uri; Baxevanis, Andreas D.

Genome Res ; 34(3): 498-513, 2024 Apr 25.

Artigo em Inglês | MEDLINE | ID: mdl-38508693

RESUMO

Hydractinia is a colonial marine hydroid that shows remarkable biological properties, including the capacity to regenerate its entire body throughout its lifetime, a process made possible by its adult migratory stem cells, known as i-cells. Here, we provide an in-depth characterization of the genomic structure and gene content of two Hydractinia species, Hydractinia symbiolongicarpus and Hydractinia echinata, placing them in a comparative evolutionary framework with other cnidarian genomes. We also generated and annotated a single-cell transcriptomic atlas for adult male H. symbiolongicarpus and identified cell-type markers for all major cell types, including key i-cell markers. Orthology analyses based on the markers revealed that Hydractinia's i-cells are highly enriched in genes that are widely shared amongst animals, a striking finding given that Hydractinia has a higher proportion of phylum-specific genes than any of the other 41 animals in our orthology analysis. These results indicate that Hydractinia's stem cells and early progenitor cells may use a toolkit shared with all animals, making it a promising model organism for future exploration of stem cell biology and regenerative medicine. The genomic and transcriptomic resources for Hydractinia presented here will enable further studies of their regenerative capacity, colonial morphology, and ability to distinguish self from nonself.

Assuntos

Genoma , Hidrozoários , Animais , Hidrozoários/genética , Evolução Molecular , Transcriptoma , Células-Tronco/metabolismo , Masculino , Filogenia , Análise de Célula Única/métodos

2.

Computational strategies to combat COVID-19: useful tools to accelerate SARS-CoV-2 and coronavirus research.

Hufsky, Franziska; Lamkiewicz, Kevin; Almeida, Alexandre; Aouacheria, Abdel; Arighi, Cecilia; Bateman, Alex; Baumbach, Jan; Beerenwinkel, Niko; Brandt, Christian; Cacciabue, Marco; Chuguransky, Sara; Drechsel, Oliver; Finn, Robert D; Fritz, Adrian; Fuchs, Stephan; Hattab, Georges; Hauschild, Anne-Christin; Heider, Dominik; Hoffmann, Marie; Hölzer, Martin; Hoops, Stefan; Kaderali, Lars; Kalvari, Ioanna; von Kleist, Max; Kmiecinski, Renó; Kühnert, Denise; Lasso, Gorka; Libin, Pieter; List, Markus; Löchel, Hannah F; Martin, Maria J; Martin, Roman; Matschinske, Julian; McHardy, Alice C; Mendes, Pedro; Mistry, Jaina; Navratil, Vincent; Nawrocki, Eric P; O'Toole, Áine Niamh; Ontiveros-Palacios, Nancy; Petrov, Anton I; Rangel-Pineros, Guillermo; Redaschi, Nicole; Reimering, Susanne; Reinert, Knut; Reyes, Alejandro; Richardson, Lorna; Robertson, David L; Sadegh, Sepideh; Singer, Joshua B.

Brief Bioinform ; 22(2): 642-663, 2021 03 22.

Artigo em Inglês | MEDLINE | ID: mdl-33147627

RESUMO

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.

Assuntos

COVID-19/prevenção & controle , Biologia Computacional , SARS-CoV-2/isolamento & purificação , Pesquisa Biomédica , COVID-19/epidemiologia , COVID-19/virologia , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genética

3.

Rfam 14: expanded coverage of metagenomic, viral and microRNA families.

Kalvari, Ioanna; Nawrocki, Eric P; Ontiveros-Palacios, Nancy; Argasinska, Joanna; Lamkiewicz, Kevin; Marz, Manja; Griffiths-Jones, Sam; Toffano-Nioche, Claire; Gautheret, Daniel; Weinberg, Zasha; Rivas, Elena; Eddy, Sean R; Finn, Robert D; Bateman, Alex; Petrov, Anton I.

Nucleic Acids Res ; 49(D1): D192-D200, 2021 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-33211869

RESUMO

Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.

Assuntos

Bases de Dados de Ácidos Nucleicos , Metagenoma , MicroRNAs/genética , RNA Bacteriano/genética , RNA não Traduzido/genética , RNA Viral/genética , Bactérias/genética , Bactérias/metabolismo , Pareamento de Bases , Sequência de Bases , Humanos , Internet , MicroRNAs/classificação , MicroRNAs/metabolismo , Anotação de Sequência Molecular , Conformação de Ácido Nucleico , RNA Bacteriano/classificação , RNA Bacteriano/metabolismo , RNA não Traduzido/classificação , RNA não Traduzido/metabolismo , RNA Viral/classificação , RNA Viral/metabolismo , Alinhamento de Sequência , Análise de Sequência de RNA , Software , Vírus/genética , Vírus/metabolismo

4.

Ribovore: ribosomal RNA sequence analysis for GenBank submissions and database curation.

Schäffer, Alejandro A; McVeigh, Richard; Robbertse, Barbara; Schoch, Conrad L; Johnston, Anjanette; Underwood, Beverly A; Karsch-Mizrachi, Ilene; Nawrocki, Eric P.

BMC Bioinformatics ; 22(1): 400, 2021 Aug 12.

Artigo em Inglês | MEDLINE | ID: mdl-34384346

RESUMO

BACKGROUND: The DNA sequences encoding ribosomal RNA genes (rRNAs) are commonly used as markers to identify species, including in metagenomics samples that may combine many organismal communities. The 16S small subunit ribosomal RNA (SSU rRNA) gene is typically used to identify bacterial and archaeal species. The nuclear 18S SSU rRNA gene, and 28S large subunit (LSU) rRNA gene have been used as DNA barcodes and for phylogenetic studies in different eukaryote taxonomic groups. Because of their popularity, the National Center for Biotechnology Information (NCBI) receives a disproportionate number of rRNA sequence submissions and BLAST queries. These sequences vary in quality, length, origin (nuclear, mitochondria, plastid), and organism source and can represent any region of the ribosomal cistron. RESULTS: To improve the timely verification of quality, origin and loci boundaries, we developed Ribovore, a software package for sequence analysis of rRNA sequences. The ribotyper and ribosensor programs are used to validate incoming sequences of bacterial and archaeal SSU rRNA. The ribodbmaker program is used to create high-quality datasets of rRNAs from different taxonomic groups. Key algorithmic steps include comparing candidate sequences against rRNA sequence profile hidden Markov models (HMMs) and covariance models of rRNA sequence and secondary-structure conservation, as well as other tests. Nine freely available blastn rRNA databases created and maintained with Ribovore are used for checking incoming GenBank submissions and used by the blastn browser interface at NCBI. Since 2018, Ribovore has been used to analyze more than 50 million prokaryotic SSU rRNA sequences submitted to GenBank, and to select at least 10,435 fungal rRNA RefSeq records from type material of 8350 taxa. CONCLUSION: Ribovore combines single-sequence and profile-based methods to improve GenBank processing and analysis of rRNA sequences. It is a standalone, portable, and extensible software package for the alignment, classification and validation of rRNA sequences. Researchers planning on submitting SSU rRNA sequences to GenBank are encouraged to download and use Ribovore to analyze their sequences prior to submission to determine which sequences are likely to be automatically accepted into GenBank.

Assuntos

Bases de Dados de Ácidos Nucleicos , RNA Ribossômico , DNA Ribossômico , Filogenia , RNA Ribossômico 16S/genética , RNA Ribossômico 18S/genética , Análise de Sequência de RNA

5.

VADR: validation and annotation of virus sequence submissions to GenBank.

Schäffer, Alejandro A; Hatcher, Eneida L; Yankie, Linda; Shonkwiler, Lara; Brister, J Rodney; Karsch-Mizrachi, Ilene; Nawrocki, Eric P.

BMC Bioinformatics ; 21(1): 211, 2020 May 24.

Artigo em Inglês | MEDLINE | ID: mdl-32448124

RESUMO

BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. RESULTS: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of "alerts" that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank's submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. CONCLUSION: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.

Assuntos

Betacoronavirus , Infecções por Coronavirus , Bases de Dados de Ácidos Nucleicos , Anotação de Sequência Molecular , Pandemias , Pneumonia Viral , Software , Betacoronavirus/genética , COVID-19 , Infecções por Coronavirus/genética , Vírus de DNA , Genômica , Humanos , Anotação de Sequência Molecular/normas , Pneumonia Viral/genética , SARS-CoV-2 , Vírus

6.

Group I introns are widespread in archaea.

Nawrocki, Eric P; Jones, Thomas A; Eddy, Sean R.

Nucleic Acids Res ; 46(15): 7970-7976, 2018 09 06.

Artigo em Inglês | MEDLINE | ID: mdl-29788499

RESUMO

Group I catalytic introns have been found in bacterial, viral, organellar, and some eukaryotic genomes, but not in archaea. All known archaeal introns are bulge-helix-bulge (BHB) introns, with the exception of a few group II introns. It has been proposed that BHB introns arose from extinct group I intron ancestors, much like eukaryotic spliceosomal introns are thought to have descended from group II introns. However, group I introns have little sequence conservation, making them difficult to detect with standard sequence similarity searches. Taking advantage of recent improvements in a computational homology search method that accounts for both conserved sequence and RNA secondary structure, we have identified 39 group I introns in a wide range of archaeal phyla, including examples of group I introns and BHB introns in the same host gene.

Assuntos

Archaea/genética , Íntrons/genética , RNA Arqueal/genética , RNA Catalítico/genética , Archaea/classificação , Archaea/enzimologia , Sequência de Bases , Conformação de Ácido Nucleico , Filogenia , RNA Arqueal/química , RNA Arqueal/classificação , RNA Catalítico/química , RNA Catalítico/classificação , Especificidade da Espécie

7.

Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families.

Kalvari, Ioanna; Argasinska, Joanna; Quinones-Olvera, Natalia; Nawrocki, Eric P; Rivas, Elena; Eddy, Sean R; Bateman, Alex; Finn, Robert D; Petrov, Anton I.

Nucleic Acids Res ; 46(D1): D335-D342, 2018 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-29112718

RESUMO

The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.

Assuntos

Bases de Dados de Ácidos Nucleicos , Genoma , RNA não Traduzido/química , RNA não Traduzido/genética , Humanos , Anotação de Sequência Molecular , Conformação de Ácido Nucleico , RNA não Traduzido/classificação , Alinhamento de Sequência , Análise de Sequência de RNA

8.

VecScreen_plus_taxonomy: imposing a tax(onomy) increase on vector contamination screening.

Schäffer, Alejandro A; Nawrocki, Eric P; Choi, Yoon; Kitts, Paul A; Karsch-Mizrachi, Ilene; McVeigh, Richard.

Bioinformatics ; 34(5): 755-759, 2018 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-29069347

RESUMO

Motivation: Nucleic acid sequences in public databases should not contain vector contamination, but many sequences in GenBank do (or did) contain vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted sequences for contamination. Additional tools are needed to distinguish true-positive (contamination) from false-positive (not contamination) VecScreen matches. Results: A principal reason for false-positive VecScreen matches is that the sequence and the matching vector subsequence originate from closely related or identical organisms (for example, both originate in Escherichia coli). We collected information on the taxonomy of sources of vector segments in the UniVec database used by VecScreen. We used that information in two overlapping software pipelines for retrospective analysis of contamination in GenBank and for prospective analysis of contamination in new sequence submissions. Using the retrospective pipeline, we identified and corrected over 8000 contaminated sequences in the nonredundant nucleotide database. The prospective analysis pipeline has been in production use since April 2017 to evaluate some new GenBank submissions. Availability and implementation: Data on the sources of UniVec entries were included in release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). The main software is freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy. Contact: aschaffe@helix.nih.gov. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Bases de Dados de Ácidos Nucleicos/normas , Análise de Sequência de DNA/métodos , Software , Bactérias , Eucariotos

9.

Virus Variation Resource - improved response to emergent viral outbreaks.

Hatcher, Eneida L; Zhdanov, Sergey A; Bao, Yiming; Blinkova, Olga; Nawrocki, Eric P; Ostapchuck, Yuri; Schäffer, Alejandro A; Brister, J Rodney.

Nucleic Acids Res ; 45(D1): D482-D490, 2017 01 04.

Artigo em Inglês | MEDLINE | ID: mdl-27899678

RESUMO

The Virus Variation Resource is a value-added viral sequence data resource hosted by the National Center for Biotechnology Information. The resource is located at http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ and includes modules for seven viral groups: influenza virus, Dengue virus, West Nile virus, Ebolavirus, MERS coronavirus, Rotavirus A and Zika virus Each module is supported by pipelines that scan newly released GenBank records, annotate genes and proteins and parse sample descriptors and then map them to controlled vocabulary. These processes in turn support a purpose-built search interface where users can select sequences based on standardized gene, protein and metadata terms. Once sequences are selected, a suite of tools for downloading data, multi-sequence alignment and tree building supports a variety of user directed activities. This manuscript describes a series of features and functionalities recently added to the Virus Variation Resource.

Assuntos

Biologia Computacional/métodos , Surtos de Doenças , Variação Genética , Software , Viroses/epidemiologia , Viroses/virologia , Vírus/genética , Bases de Dados Genéticas

10.

NCBI prokaryotic genome annotation pipeline.

Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James.

Nucleic Acids Res ; 44(14): 6614-24, 2016 08 19.

Artigo em Inglês | MEDLINE | ID: mdl-27342282

RESUMO

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

Assuntos

Genoma Bacteriano , Anotação de Sequência Molecular , Células Procarióticas/metabolismo , Bactérias/genética , Proteínas de Bactérias/química , Bases de Dados de Ácidos Nucleicos , Genes Bacterianos

11.

Rfam 12.0: updates to the RNA families database.

Nawrocki, Eric P; Burge, Sarah W; Bateman, Alex; Daub, Jennifer; Eberhardt, Ruth Y; Eddy, Sean R; Floden, Evan W; Gardner, Paul P; Jones, Thomas A; Tate, John; Finn, Robert D.

Nucleic Acids Res ; 43(Database issue): D130-7, 2015 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-25392425

RESUMO

The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.

Assuntos

Bases de Dados de Ácidos Nucleicos , RNA não Traduzido/química , Genômica , Internet , Anotação de Sequência Molecular , Conformação de Ácido Nucleico , Motivos de Nucleotídeos , RNA Longo não Codificante/química , RNA não Traduzido/classificação , Software

12.

Rfam 11.0: 10 years of RNA families.

Burge, Sarah W; Daub, Jennifer; Eberhardt, Ruth; Tate, John; Barquist, Lars; Nawrocki, Eric P; Eddy, Sean R; Gardner, Paul P; Bateman, Alex.

Nucleic Acids Res ; 41(Database issue): D226-32, 2013 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-23125362

RESUMO

The Rfam database (available via the website at http://rfam.sanger.ac.uk and through our mirror at http://rfam.janelia.org) is a collection of non-coding RNA families, primarily RNAs with a conserved RNA secondary structure, including both RNA genes and mRNA cis-regulatory elements. Each family is represented by a multiple sequence alignment, predicted secondary structure and covariance model. Here we discuss updates to the database in the latest release, Rfam 11.0, including the introduction of genome-based alignments for large families, the introduction of the Rfam Biomart as well as other user interface improvements. Rfam is available under the Creative Commons Zero license.

Assuntos

Bases de Dados de Ácidos Nucleicos , RNA não Traduzido/química , RNA não Traduzido/classificação , Sequência de Bases , Genômica , Internet , Anotação de Sequência Molecular , Conformação de Ácido Nucleico , RNA não Traduzido/genética , Alinhamento de Sequência , Interface Usuário-Computador

13.

Infernal 1.1: 100-fold faster RNA homology searches.

Nawrocki, Eric P; Eddy, Sean R.

Bioinformatics ; 29(22): 2933-5, 2013 Nov 15.

Artigo em Inglês | MEDLINE | ID: mdl-24008419

RESUMO

SUMMARY: Infernal builds probabilistic profiles of the sequence and secondary structure of an RNA family called covariance models (CMs) from structurally annotated multiple sequence alignments given as input. Infernal uses CMs to search for new family members in sequence databases and to create potentially large multiple sequence alignments. Version 1.1 of Infernal introduces a new filter pipeline for RNA homology search based on accelerated profile hidden Markov model (HMM) methods and HMM-banded CM alignment methods. This enables â¼100-fold acceleration over the previous version and â¼10 000-fold acceleration over exhaustive non-filtered CM searches. AVAILABILITY: Source code, documentation and the benchmark are downloadable from http://infernal.janelia.org. Infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Documentation includes a user's guide with a tutorial, a discussion of file formats and user options and additional details on methods implemented in the software. CONTACT: nawrockie@janelia.hhmi.org

Assuntos

RNA/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA , Homologia de Sequência do Ácido Nucleico , Software , Algoritmos , Conformação de Ácido Nucleico

14.

Influenza sequence validation and annotation using VADR.

Calhoun, Vincent C; Hatcher, Eneida L; Yankie, Linda; Nawrocki, Eric P.

bioRxiv ; 2024 Mar 25.

Artigo em Inglês | MEDLINE | ID: mdl-38712272

RESUMO

Tens of thousands of influenza sequences are deposited into the GenBank database each year. The software tool FLAN has been used by GenBank since 2007 to validate and annotate incoming influenza sequence submissions, and has been publicly available as a webserver but not as a standalone tool. VADR is a general sequence validation and annotation software package used by GenBank for Norovirus, Dengue virus and SARS-CoV-2 virus sequence processing that is available as a standalone tool. We have created VADR influenza models based on the FLAN reference sequences and adapted VADR to accurately annotate influenza sequences. VADR and FLAN show consistent results on the vast majority of influenza sequences, and when they disagree VADR is usually correct. VADR can also accurately process influenza D sequences as well as influenza A H17, H18, H19, N10 and N11 subtype sequences, which FLAN cannot. VADR 1.6.3 and the associated influenza models are now freely available for users to download and use.

15.

Computational identification of functional RNA homologs in metagenomic data.

Nawrocki, Eric P; Eddy, Sean R.

RNA Biol ; 10(7): 1170-9, 2013 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-23722291

RESUMO

A key step toward understanding a metagenomics data set is the identification of functional sequence elements within it, such as protein coding genes and structural RNAs. Relative to protein coding genes, structural RNAs are more difficult to identify because of their reduced alphabet size, lack of open reading frames, and short length. Infernal is a software package that implements "covariance models" (CMs) for RNA homology search, which harness both sequence and structural conservation when searching for RNA homologs. Thanks to the added statistical signal inherent in the secondary structure conservation of many RNA families, Infernal is more powerful than sequence-only based methods such as BLAST and profile HMMs. Together with the Rfam database of CMs, Infernal is a useful tool for identifying RNAs in metagenomics data sets.

Assuntos

Biologia Computacional/métodos , Metagenômica , RNA/química , Algoritmos , Bases de Dados de Ácidos Nucleicos , Conformação de Ácido Nucleico , RNA/genética , Ferramenta de Busca , Homologia de Sequência do Ácido Nucleico , Software

16.

RNIE: genome-wide prediction of bacterial intrinsic terminators.

Gardner, Paul P; Barquist, Lars; Bateman, Alex; Nawrocki, Eric P; Weinberg, Zasha.

Nucleic Acids Res ; 39(14): 5845-52, 2011 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-21478170

RESUMO

Bacterial Rho-independent terminators (RITs) are important genomic landmarks involved in gene regulation and terminating gene expression. In this investigation we present RNIE, a probabilistic approach for predicting RITs. The method is based upon covariance models which have been known for many years to be the most accurate computational tools for predicting homology in structural non-coding RNAs. We show that RNIE has superior performance in model species from a spectrum of bacterial phyla. Further analysis of species where a low number of RITs were predicted revealed a highly conserved structural sequence motif enriched near the genic termini of the pathogenic Actinobacteria, Mycobacterium tuberculosis. This motif, together with classical RITs, account for up to 90% of all the significantly structured regions from the termini of M. tuberculosis genic elements. The software, predictions and alignments described below are available from http://github.com/ppgardne/RNIE.

Assuntos

Genoma Bacteriano , Modelos Estatísticos , Regiões Terminadoras Genéticas , Sequência de Bases , Sequência Conservada , Genômica/métodos , Anotação de Sequência Molecular , Mycobacterium tuberculosis/genética , Software

17.

Rfam: Wikipedia, clans and the "decimal" release.

Gardner, Paul P; Daub, Jennifer; Tate, John; Moore, Benjamin L; Osuch, Isabelle H; Griffiths-Jones, Sam; Finn, Robert D; Nawrocki, Eric P; Kolbe, Diana L; Eddy, Sean R; Bateman, Alex.

Nucleic Acids Res ; 39(Database issue): D141-5, 2011 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-21062808

RESUMO

The Rfam database aims to catalogue non-coding RNAs through the use of sequence alignments and statistical profile models known as covariance models. In this contribution, we discuss the pros and cons of using the online encyclopedia, Wikipedia, as a source of community-derived annotation. We discuss the addition of groupings of related RNA families into clans and new developments to the website. Rfam is available on the Web at http://rfam.sanger.ac.uk.

Assuntos

Bases de Dados de Ácidos Nucleicos , RNA não Traduzido/química , Enciclopédias como Assunto , Modelos Estatísticos , Conformação de Ácido Nucleico , RNA não Traduzido/classificação , Alinhamento de Sequência , Análise de Sequência de RNA

18.

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR.

Nawrocki, Eric P.

NAR Genom Bioinform ; 5(1): lqad002, 2023 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-36685728

RESUMO

In 2020 and 2021, >1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. VADR is now nearly 1000 times faster than it was in early 2020 SARS-CoV-2 sequence processing. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month.

19.

The genome of the colonial hydroid Hydractinia reveals their stem cells utilize a toolkit of evolutionarily shared genes with all animals.

Schnitzler, Christine E; Chang, E Sally; Waletich, Justin; Quiroga-Artigas, Gonzalo; Wong, Wai Yee; Nguyen, Anh-Dao; Barreira, Sofia N; Doonan, Liam; Gonzalez, Paul; Koren, Sergey; Gahan, James M; Sanders, Steven M; Bradshaw, Brian; DuBuc, Timothy Q; de Jong, Danielle; Nawrocki, Eric P; Larson, Alexandra; Klasfeld, Samantha; Gornik, Sebastian G; Moreland, R Travis; Wolfsberg, Tyra G; Phillippy, Adam M; Mullikin, James C; Simakov, Oleg; Cartwright, Paulyn; Nicotra, Matthew; Frank, Uri; Baxevanis, Andreas D.

bioRxiv ; 2023 Aug 27.

Artigo em Inglês | MEDLINE | ID: mdl-37786714

RESUMO

Hydractinia is a colonial marine hydroid that exhibits remarkable biological properties, including the capacity to regenerate its entire body throughout its lifetime, a process made possible by its adult migratory stem cells, known as i-cells. Here, we provide an in-depth characterization of the genomic structure and gene content of two Hydractinia species, H. symbiolongicarpus and H. echinata, placing them in a comparative evolutionary framework with other cnidarian genomes. We also generated and annotated a single-cell transcriptomic atlas for adult male H. symbiolongicarpus and identified cell type markers for all major cell types, including key i-cell markers. Orthology analyses based on the markers revealed that Hydractinia's i-cells are highly enriched in genes that are widely shared amongst animals, a striking finding given that Hydractinia has a higher proportion of phylum-specific genes than any of the other 41 animals in our orthology analysis. These results indicate that Hydractinia's stem cells and early progenitor cells may use a toolkit shared with all animals, making it a promising model organism for future exploration of stem cell biology and regenerative medicine. The genomic and transcriptomic resources for Hydractinia presented here will enable further studies of their regenerative capacity, colonial morphology, and ability to distinguish self from non-self.

20.

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR.

Nawrocki, Eric P.

bioRxiv ; 2022 Apr 27.

Artigo em Inglês | MEDLINE | ID: mdl-35547842

RESUMO

Background: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. Results: VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch , increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. Conclusion: VADR is now nearly 1000 times faster than it was in early 2020 for processing SARS-CoV-2 sequences submitted to GenBank. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month. Version 1.4.1 is freely available ( https://github.com/ncbi/vadr ) for local installation and use.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA