RESUMO
CRISPR-cas loci typically contain CRISPR arrays with unique spacers separating direct repeats. Spacers along with portions of adjacent repeats are transcribed and processed into CRISPR(cr) RNAs that target complementary sequences (protospacers) in mobile genetic elements, resulting in cleavage of the target DNA or RNA. Additional, standalone repeats in some CRISPR-cas loci produce distinct cr-like RNAs implicated in regulatory or other functions. We developed a computational pipeline to systematically predict crRNA-like elements by scanning for standalone repeat sequences that are conserved in closely related CRISPR-cas loci. Numerous crRNA-like elements were detected in diverse CRISPR-Cas systems, mostly, of type I, but also subtype V-A. Standalone repeats often form mini-arrays containing two repeat-like sequence separated by a spacer that is partially complementary to promoter regions of cas genes, in particular cas8, or cargo genes located within CRISPR-Cas loci, such as toxins-antitoxins. We show experimentally that a mini-array from a type I-F1 CRISPR-Cas system functions as a regulatory guide. We also identified mini-arrays in bacteriophages that could abrogate CRISPR immunity by inhibiting effector expression. Thus, recruitment of CRISPR effectors for regulatory functions via spacers with partial complementarity to the target is a common feature of diverse CRISPR-Cas systems.
Assuntos
Sistemas CRISPR-Cas , RNA , Sequências Repetitivas de Ácido NucleicoRESUMO
CRISPR- cas loci typically contain CRISPR arrays with unique spacers separating direct repeats. Spacers along with portions of adjacent repeats are transcribed and processed into CRISPR(cr) RNAs that target complementary sequences (protospacers) in mobile genetic elements, resulting in cleavage of the target DNA or RNA. Additional, standalone repeats in some CRISPR- cas loci produce distinct cr-like RNAs implicated in regulatory or other functions. We developed a computational pipeline to systematically predict crRNA-like elements by scanning for standalone repeat sequences that are conserved in closely related CRISPR- cas loci. Numerous crRNA-like elements were detected in diverse CRISPR-Cas systems, mostly, of type I, but also subtype V-A. Standalone repeats often form mini-arrays containing two repeat-like sequence separated by a spacer that is partially complementary to promoter regions of cas genes, in particular cas8 , or cargo genes located within CRISPR-Cas loci, such as toxins-antitoxins. We show experimentally that a mini-array from a type I-F1 CRISPR-Cas system functions as a regulatory guide. We also identified mini-arrays in bacteriophages that could abrogate CRISPR immunity by inhibiting effector expression. Thus, recruitment of CRISPR effectors for regulatory functions via spacers with partial complementarity to the target is a common feature of diverse CRISPR-Cas systems.
RESUMO
The analysis of deletions may reveal evolutionary trends and provide new insight into the surprising variability and rapidly spreading capability that SARS-CoV-2 has shown since its emergence. To understand the factors governing genomic stability, it is important to define the molecular mechanisms of deletions in the viral genome. In this work, we performed a statistical analysis of deletions. Specifically, we analyzed correlations between deletions in the SARS-CoV-2 genome and repetitive elements and documented a significant association of deletions with runs of identical (poly-) nucleotides and direct repeats. Our analyses of deletions in the accessory genes of SARS-CoV-2 suggested that there may be a hypervariability in ORF7A and ORF8 that is not associated with repetitive elements. Such recurrent search in a "sequence space" of accessory genes (that might be driven by natural selection) did not yet cause increased viability of the SARS-CoV-2 variants. However, deletions in the accessory genes may ultimately produce new variants that are more successful compared to the viral strains with the conventional architecture of the SARS-CoV-2 accessory genes.
RESUMO
Antimicrobial resistance (AMR) is a significant public health threat. Low-cost whole-genome sequencing, which is often used in surveillance programmes, provides an opportunity to assess AMR gene content in these genomes using in silico approaches. A variety of bioinformatic tools have been developed to identify these genomic elements. Most of those tools rely on reference databases of nucleotide or protein sequences and collections of models and rules for analysis. While the tools are critical for the identification of AMR genes, the databases themselves also provide significant utility for researchers, for applications ranging from sequence analysis to information about AMR phenotypes. Additionally, these databases can be evaluated by domain experts and others to ensure their accuracy. Here we describe how we curate the genes, point mutations and blast rules, and hidden Markov models used in NCBI's AMRFinderPlus, along with the quality-control steps we take to ensure database quality. We also describe the web interfaces that display the full structure of the database and their newly developed cross-browser relationships. Then, using the Reference Gene Catalog as an example, we detail how the databases, rules and models are made publicly available, as well as how to access the software. In addition, as part of the Pathogen Detection system, we have analysed over 1 million publicly available genomes using AMRFinderPlus and its databases. We discuss how the computed analyses generated by those tools can be accessed through a web interface. Finally, we conclude with NCBI's plans to make these databases accessible over the long-term.
Assuntos
Biologia Computacional , Software , Sequência de Aminoácidos , Sequenciamento Completo do GenomaRESUMO
Publicly available and validated DNA reference sequences useful for phylogeny estimation and identification of fungal pathogens are an increasingly important resource in the efforts of plant protection organizations to facilitate safe international trade of agricultural commodities. Colletotrichum species are among the most frequently encountered and regulated plant pathogens at U.S. ports-of-entry. The RefSeq Targeted Loci (RTL) project at NCBI (BioProject no. PRJNA177353) contains a database of curated fungal internal transcribed spacer (ITS) sequences that interact extensively with NCBI Taxonomy, resulting in verified name-strain-sequence type associations for >12,000 species. We present a publicly available dataset of verified and curated name-type strain-sequence associations for all available Colletotrichum species. This includes an updated GenBank Taxonomy for 238 species associated with up to 11 protein coding loci and an updated RTL ITS dataset for 226 species. We demonstrate that several marker loci are well suited for phylogenetic inference and identification. We improve understanding of phylogenetic relationships among verified species, verify or improve phylogenetic circumscriptions of 14 species complexes, and reveal that determining relationships among these major clades will require additional data. We present detailed comparisons between phylogenetic and similarity-based approaches to species identification, revealing complex patterns among single marker loci that often lead to misidentification when based on single-locus similarity approaches. We also demonstrate that species-level identification is elusive for a subset of samples regardless of analytical approach, which may be explained by novel species diversity in our dataset and incomplete lineage sorting and lack of accumulated synapomorphies at these loci.
Assuntos
Colletotrichum , Colletotrichum/genética , Comércio , DNA , Internacionalidade , FilogeniaRESUMO
Antimicrobial resistance (AMR) is a significant public health threat. With the rise of affordable whole genome sequencing, in silico approaches to assessing AMR gene content can be used to detect known resistance mechanisms and potentially identify novel mechanisms. To enable accurate assessment of AMR gene content, as part of a multi-agency collaboration, NCBI developed a comprehensive AMR gene database, the Bacterial Antimicrobial Resistance Reference Gene Database and the AMR gene detection tool AMRFinder. Here, we describe the expansion of the Reference Gene Database, now called the Reference Gene Catalog, to include putative acid, biocide, metal, stress resistance genes, in addition to virulence genes and species-specific point mutations. Genes and point mutations are classified by broad functions, as well as more detailed functions. As we have expanded both the functional repertoire of identified genes and functionality, NCBI released a new version of AMRFinder, known as AMRFinderPlus. This new tool allows users the option to utilize only the core set of AMR elements, or include stress response and virulence genes, too. AMRFinderPlus can detect acquired genes and point mutations in both protein and nucleotide sequence. In addition, the evidence used to identify the gene has been expanded to include whether nucleotide or protein sequence was used, its location in the contig, and presence of an internal stop codon. These database improvements and functional expansions will enable increased precision in identifying AMR genes, linking AMR genotypes and phenotypes, and determining possible relationships between AMR, virulence, and stress response.
Assuntos
Antibacterianos/farmacologia , Bactérias/efeitos dos fármacos , Bases de Dados Genéticas , Farmacorresistência Bacteriana/genética , Genes Bacterianos , Bactérias/genética , Bactérias/patogenicidade , Farmacorresistência Bacteriana Múltipla/genética , Genoma Bacteriano , Mercúrio/farmacologia , Plasmídeos , Salmonella/efeitos dos fármacos , Salmonella/genética , Virulência/genéticaRESUMO
Antimicrobial resistance (AMR) is a major public health problem that requires publicly available tools for rapid analysis. To identify AMR genes in whole-genome sequences, the National Center for Biotechnology Information (NCBI) has produced AMRFinder, a tool that identifies AMR genes using a high-quality curated AMR gene reference database. The Bacterial Antimicrobial Resistance Reference Gene Database consists of up-to-date gene nomenclature, a set of hidden Markov models (HMMs), and a curated protein family hierarchy. Currently, it contains 4,579 antimicrobial resistance proteins and more than 560 HMMs. Here, we describe AMRFinder and its associated database. To assess the predictive ability of AMRFinder, we measured the consistency between predicted AMR genotypes from AMRFinder and resistance phenotypes of 6,242 isolates from the National Antimicrobial Resistance Monitoring System (NARMS). This included 5,425 Salmonella enterica, 770 Campylobacter spp., and 47 Escherichia coli isolates phenotypically tested against various antimicrobial agents. Of 87,679 susceptibility tests performed, 98.4% were consistent with predictions. To assess the accuracy of AMRFinder, we compared its gene symbol output with that of a 2017 version of ResFinder, another publicly available resistance gene detection system. Most gene calls were identical, but there were 1,229 gene symbol differences (8.8%) between them, with differences due to both algorithmic differences and database composition. AMRFinder missed 16 loci that ResFinder found, while ResFinder missed 216 loci that AMRFinder identified. Based on these results, AMRFinder appears to be a highly accurate AMR gene detection system.
RESUMO
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.
Assuntos
Curadoria de Dados , Bases de Dados de Ácidos Nucleicos , Genoma , Anotação de Sequência Molecular , Células Procarióticas , Archaea/genética , Bactérias/genética , Bases de Dados de Proteínas , Eucariotos/genética , Previsões , Humanos , Homologia de Sequência , Software , Vírus/genéticaRESUMO
In 1994, analyses of clostridial 16S rRNA gene sequences led to the assignment of 18 species to Clostridium cluster XI, separating them from Clostridium sensu stricto (Clostridium cluster I). Subsequently, most cluster XI species have been assigned to the family Peptostreptococcaceae with some species being reassigned to new genera. However, several misclassified Clostridium species remained, creating a taxonomic conundrum and confusion regarding their status. Here, we have re-examined the phylogeny of cluster XI species by comparing the 16S rRNA gene-based trees with protein- and genome-based trees, where available. The resulting phylogeny of the Peptostreptococcaceae was consistent with the recent proposals on creating seven new genera within this family. This analysis also revealed a tight clustering of Clostridium litorale and Eubacterium acidaminophilum. Based on these data, we propose reassigning these two organisms to the new genus Peptoclostridium as Peptoclostridium litorale gen. nov. comb. nov. (the type species of the genus) and Peptoclostridium acidaminophilum comb. nov., respectively. As correctly noted in the original publications, the genera Acetoanaerobium and Proteocatella also fall within cluster XI, and can be assigned to the Peptostreptococcaceae. Clostridium sticklandii, which falls within radiation of genus Acetoanaerobium, is proposed to be reclassified as Acetoanaerobium sticklandii comb. nov. The remaining misnamed members of the Peptostreptococcaceae, [Clostridium] hiranonis, [Clostridium] paradoxum and [Clostridium] thermoalcaliphilum, still remain to be properly classified.
Assuntos
Clostridium/classificação , Eubacterium/classificação , Filogenia , Técnicas de Tipagem Bacteriana , Composição de Bases , DNA Bacteriano/genética , RNA Ribossômico 16S/genética , Análise de Sequência de DNARESUMO
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
Assuntos
Bases de Dados Genéticas , Genômica , Animais , Bovinos , Perfilação da Expressão Gênica , Genoma Fúngico , Genoma Humano , Genoma Microbiano , Genoma de Planta , Genoma Viral , Genômica/normas , Humanos , Invertebrados/genética , Camundongos , Anotação de Sequência Molecular , Nematoides/genética , Filogenia , RNA Longo não Codificante/genética , Ratos , Padrões de Referência , Análise de Sequência de Proteína , Análise de Sequência de RNA , Vertebrados/genéticaRESUMO
We present a large portion of the transcriptome of Zea mays, including ESTs representing 484,032 cDNA clones from 53 libraries and 36,565 fully sequenced cDNA clones, out of which 31,552 clones are non-redundant. These and other previously sequenced transcripts have been aligned with available genome sequences and have provided new insights into the characteristics of gene structures and promoters within this major crop species. We found that although the average number of introns per gene is about the same in corn and Arabidopsis, corn genes have more alternatively spliced isoforms. Examination of the nucleotide composition of coding regions reveals that corn genes, as well as genes of other Poaceae (Grass family), can be divided into two classes according to the GC content at the third position in the amino acid encoding codons. Many of the transcripts that have lower GC content at the third position have dicot homologs but the high GC content transcripts tend to be more specific to the grasses. The high GC content class is also enriched with intronless genes. Together this suggests that an identifiable class of genes in plants is associated with the Poaceae divergence. Furthermore, because many of these genes appear to be derived from ancestral genes that do not contain introns, this evolutionary divergence may be the result of horizontal gene transfer from species not only with different codon usage but possibly that did not have introns, perhaps outside of the plant kingdom. By comparing the cDNAs described herein with the non-redundant set of corn mRNAs in GenBank, we estimate that there are about 50,000 different protein coding genes in Zea. All of the sequence data from this study have been submitted to DDBJ/GenBank/EMBL under accession numbers EU940701-EU977132 (FLI cDNA) and FK944382-FL482108 (EST).
Assuntos
DNA Complementar/genética , Genes de Plantas , Zea mays/genética , Processamento Alternativo , Sequência de Bases , Primers do DNA , Etiquetas de Sequências Expressas , Regiões Promotoras Genéticas , Transcrição GênicaRESUMO
Arabidopsis is currently the reference genome for higher plants. A new, more detailed statistical analysis of Arabidopsis gene structure is presented including intron and exon lengths, intergenic distances, features of promoters, and variant 5'-ends of mRNAs transcribed from the same transcription unit. We also provide a statistical characterization of Arabidopsis transcripts in terms of their size, UTR lengths, 3'-end cleavage sites, splicing variants, and coding potential. These analyses were facilitated by scrutiny of our collection of sequenced full-length cDNAs and much larger collection of 5'-ESTs, together with another set of full-length cDNAs from Salk/Stanford/Plant Gene Expression Center/RIKEN. Examples of alternative splicing are observed for transcripts from 7% of the genes and many of these genes display multiple spliced isoforms. Most splicing variants lie in non-coding regions of the transcripts. Non-canonical splice sites constitute less than 1% of all splice sites. Genes with fewer than four introns display reduced average mRNA levels. Putative alternative transcription start sites were observed in 30% of highly expressed genes and in more than 50% of the genes with low expression. Transcription start sites correlate remarkably well with a CG skew peak in the DNA sequences. The intergenic distances vary considerably, those where genes are transcribed towards one another being significantly shorter. New transcripts, missing in the current TIGR genome annotation and ESTs that are non-coding, including those antisense to known genes, are derived and cataloged in the Supplementary Material. They identify 148 new loci in the Arabidopsis genome. The conclusions drawn provide a better understanding of the Arabidopsis genome and how the gene transcripts are processed. The results also allow better predictions to be made for, as yet, poorly defined genes and provide a reference for comparisons with other plant genomes whose complete sequences are currently being determined. Some comparisons with rice are included in this paper.
Assuntos
Arabidopsis/genética , DNA Complementar/genética , Genes de Plantas/genética , Genoma de Planta , Processamento Alternativo , Sequência de Bases , DNA Intergênico , DNA de Plantas/genética , Éxons/genética , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas , Íntrons/genética , Sítio de Iniciação de TranscriçãoRESUMO
We have discovered a novel statistical feature of Arabidopsis thaliana genome that remarkably correlates with a position of transcription start site--CG skew peak. We hypothesize that the phenomenon can be explained by the higher mutability of unprotected cytosines.