RESUMO
High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium's minimal information (MIxS) and NCBI's BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.
Assuntos
Bases de Dados Genéticas/normas , Animais , Doenças Transmissíveis/microbiologia , Doenças Transmissíveis/parasitologia , Conjuntos de Dados como Assunto , Vetores de Doenças , Ontologia Genética , Genoma , Humanos , Padrões de Referência , Análise de Sequência de DNA , Virulência/genéticaRESUMO
Leptospirosis is a globally important, neglected zoonotic infection caused by spirochetes of the genus Leptospira. Since genetic transformation remains technically limited for pathogenic Leptospira, a systems biology pathogenomic approach was used to infer leptospiral virulence genes by whole genome comparison of culture-attenuated Leptospira interrogans serovar Lai with its virulent, isogenic parent. Among the 11 pathogen-specific protein-coding genes in which non-synonymous mutations were found, a putative soluble adenylate cyclase with host cell cAMP-elevating activity, and two members of a previously unstudied â¼15 member paralogous gene family of unknown function were identified. This gene family was also uniquely found in the alpha-proteobacteria Bartonella bacilliformis and Bartonella australis that are geographically restricted to the Andes and Australia, respectively. How the pathogenic Leptospira and these two Bartonella species came to share this expanded gene family remains an evolutionary mystery. In vivo expression analyses demonstrated up-regulation of 10/11 Leptospira genes identified in the attenuation screen, and profound in vivo, tissue-specific up-regulation by members of the paralogous gene family, suggesting a direct role in virulence and host-pathogen interactions. The pathogenomic experimental design here is generalizable as a functional systems biology approach to studying bacterial pathogenesis and virulence and should encourage similar experimental studies of other pathogens.
Assuntos
Proteínas de Bactérias/genética , Genoma Bacteriano , Leptospira interrogans/genética , Leptospira interrogans/patogenicidade , Leptospirose/microbiologia , Fatores de Virulência/genética , Animais , Proteínas de Bactérias/biossíntese , Bartonella/genética , Cricetinae , Análise Mutacional de DNA , Regulação Bacteriana da Expressão Gênica , Mesocricetus , Análise de Sequência de DNA , Fatores de Virulência/biossínteseRESUMO
Although biofilms have been shown to be reservoirs of pathogens, our knowledge of the microbial diversity in biofilms within critical areas, such as health care facilities, is limited. Available methods for pathogen identification and strain typing have some inherent restrictions. In particular, culturing will yield only a fraction of the species present, PCR of virulence or marker genes is mainly focused on a handful of known species, and shotgun metagenomics is limited in the ability to detect strain variations. In this study, we present a single-cell genome sequencing approach to address these limitations and demonstrate it by specifically targeting bacterial cells within a complex biofilm from a hospital bathroom sink drain. A newly developed, automated platform was used to generate genomic DNA by the multiple displacement amplification (MDA) technique from hundreds of single cells in parallel. MDA reactions were screened and classified by 16S rRNA gene PCR sequence, which revealed a broad range of bacteria covering 25 different genera representing environmental species, human commensals, and opportunistic human pathogens. Here we focus on the recovery of a nearly complete genome representing a novel strain of the periodontal pathogen Porphyromonas gingivalis (P. gingivalis JCVI SC001) using the single-cell assembly tool SPAdes. Single-cell genomics is becoming an accepted method to capture novel genomes, primarily in the marine and soil environments. Here we show for the first time that it also enables comparative genomic analysis of strain variation in a pathogen captured from complex biofilm samples in a healthcare facility.
Assuntos
Biofilmes , Sequenciamento de Nucleotídeos em Larga Escala , Porphyromonas gingivalis/genética , Análise de Célula Única , Infecções por Bacteroidaceae/genética , Infecções por Bacteroidaceae/microbiologia , Infecção Hospitalar/genética , Infecção Hospitalar/microbiologia , Genoma Bacteriano , Humanos , Porphyromonas gingivalis/patogenicidadeRESUMO
CharProtDB (http://www.jcvi.org/charprotdb/) is a curated database of biochemically characterized proteins. It provides a source of direct rather than transitive assignments of function, designed to support automated annotation pipelines. The initial data set in CharProtDB was collected through manual literature curation over the years by analysts at the J. Craig Venter Institute (JCVI) [formerly The Institute of Genomic Research (TIGR)] as part of their prokaryotic genome sequencing projects. The CharProtDB has been expanded by import of selected records from publicly available protein collections whose biocuration indicated direct rather than homology-based assignment of function. Annotations in CharProtDB include gene name, symbol and various controlled vocabulary terms, including Gene Ontology terms, Enzyme Commission number and TransportDB accession. Each annotation is referenced with the source; ideally a journal reference, or, if imported and lacking one, the original database source.
Assuntos
Bases de Dados de Proteínas , Anotação de Sequência Molecular , Proteínas/química , Proteínas/genética , Proteínas/fisiologiaRESUMO
Yersinia pestis is the causative agent of the plague. Y. pestis KIM 10+ strain was passaged and selected for loss of the 102 kb pgm locus, resulting in an attenuated strain, KIM D27. In this study, whole genome sequencing was performed on KIM D27 in order to identify any additional differences. Initial assemblies of 454 data were highly fragmented, and various bioinformatic tools detected between 15 and 465 SNPs and INDELs when comparing both strains, the vast majority associated with A or T homopolymer sequences. Consequently, Illumina sequencing was performed to improve the quality of the assembly. Hybrid sequence assemblies were performed and a total of 56 validated SNP/INDELs and 5 repeat differences were identified in the D27 strain relative to published KIM 10+ sequence. However, further analysis showed that 55 of these SNP/INDELs and 3 repeats were errors in the KIM 10+ reference sequence. We conclude that both 454 and Illumina sequencing were required to obtain the most accurate and rapid sequence results for Y. pestis KIMD27. SNP and INDELS calls were most accurate when both Newbler and CLC Genomics Workbench were employed. For purposes of obtaining high quality genome sequence differences between strains, any identified differences should be verified in both the new and reference genomes.
Assuntos
Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Yersinia pestis/genética , Yersinia pestis/metabolismo , Primers do DNA/genética , Genoma Bacteriano , Humanos , Polimorfismo de Nucleotídeo Único , Sequências Repetitivas de Ácido Nucleico , Reprodutibilidade dos Testes , Especificidade da Espécie , VirulênciaRESUMO
The human microbiome refers to the community of microorganisms, including prokaryotes, viruses, and microbial eukaryotes, that populate the human body. The National Institutes of Health launched an initiative that focuses on describing the diversity of microbial species that are associated with health and disease. The first phase of this initiative includes the sequencing of hundreds of microbial reference genomes, coupled to metagenomic sequencing from multiple body sites. Here we present results from an initial reference genome sequencing of 178 microbial genomes. From 547,968 predicted polypeptides that correspond to the gene complement of these strains, previously unidentified ("novel") polypeptides that had both unmasked sequence length greater than 100 amino acids and no BLASTP match to any nonreference entry in the nonredundant subset were defined. This analysis resulted in a set of 30,867 polypeptides, of which 29,987 (approximately 97%) were unique. In addition, this set of microbial genomes allows for approximately 40% of random sequences from the microbiome of the gastrointestinal tract to be associated with organisms based on the match criteria used. Insights into pan-genome analysis suggest that we are still far from saturating microbial species genetic data sets. In addition, the associated metrics and standards used by our group for quality assurance are presented.
Assuntos
Genoma Bacteriano , Metagenoma/genética , Análise de Sequência de DNA , Bactérias/classificação , Bactérias/genética , Proteínas de Bactérias/química , Proteínas de Bactérias/genética , Biodiversidade , Biologia Computacional , Bases de Dados Genéticas , Trato Gastrointestinal/microbiologia , Genes Bacterianos , Variação Genética , Genoma Arqueal , Humanos , Metagenômica/métodos , Metagenômica/normas , Boca/microbiologia , Peptídeos/química , Peptídeos/genética , Filogenia , Sistema Respiratório/microbiologia , Análise de Sequência de DNA/normas , Pele/microbiologia , Sistema Urogenital/microbiologiaRESUMO
Generation of syntactically correct and unambiguous names for proteins is a challenging, yet vital task for functional annotation processes. Proteins are often named based on homology to known proteins, many of which have problematic names. To address the need to generate high-quality protein names, and capture our significant experience correcting protein names manually, we have developed the Protein Naming Utility (PNU, http://www.jcvi.org/pn-utility). The PNU is a web-based database for storing and applying naming rules to identify and correct syntactically incorrect protein names, or to replace synonyms with their preferred name. The PNU allows users to generate and manage collections of naming rules, optionally building upon the growing body of rules generated at the J. Craig Venter Institute (JCVI). Since communities often enforce disparate conventions for naming proteins, the PNU supports grouping rules into user-managed collections. Users can check their protein names against a selected PNU rule collection, generating both statistics and corrected names. The PNU can also be used to correct GenBank table files prior to submission to GenBank. Currently, the database features 3080 manual rules that have been entered by JCVI Bioinformatics Analysts as well as 7458 automatically imported names.
Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Bases de Dados de Proteínas , Proteínas/química , Terminologia como Assunto , Algoritmos , Animais , Automação , Biologia Computacional/tendências , Genoma , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , SoftwareRESUMO
We present the complete 2,843,201-bp genome sequence of Treponema denticola (ATCC 35405) an oral spirochete associated with periodontal disease. Analysis of the T. denticola genome reveals factors mediating coaggregation, cell signaling, stress protection, and other competitive and cooperative measures, consistent with its pathogenic nature and lifestyle within the mixed-species environment of subgingival dental plaque. Comparisons with previously sequenced spirochete genomes revealed specific factors contributing to differences and similarities in spirochete physiology as well as pathogenic potential. The T. denticola genome is considerably larger in size than the genome of the related syphilis-causing spirochete Treponema pallidum. The differences in gene content appear to be attributable to a combination of three phenomena: genome reduction, lineage-specific expansions, and horizontal gene transfer. Genes lost due to reductive evolution appear to be largely involved in metabolism and transport, whereas some of the genes that have arisen due to lineage-specific expansions are implicated in various pathogenic interactions, and genes acquired via horizontal gene transfer are largely phage-related or of unknown function.