RESUMO
We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.
Assuntos
Farmacorresistência Bacteriana/genética , Metagenômica , Microbiota/genética , População Urbana , Biodiversidade , Bases de Dados Genéticas , HumanosRESUMO
Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote.
Assuntos
Proteínas de Bactérias/genética , Bartonella henselae/genética , Bradyrhizobium/genética , Escherichia coli/genética , Genoma Bacteriano , Proteogenômica , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Fases de Leitura Aberta , SoftwareRESUMO
INTRODUCTION: As one of the most prevalent chronic diseases in the United States, diabetes, especially type 2 diabetes, affects the health of millions of people and puts an enormous financial burden on the US economy. We aimed to develop predictive models to identify risk factors for type 2 diabetes, which could help facilitate early diagnosis and intervention and also reduce medical costs. METHODS: We analyzed cross-sectional data on 138,146 participants, including 20,467 with type 2 diabetes, from the 2014 Behavioral Risk Factor Surveillance System. We built several machine learning models for predicting type 2 diabetes, including support vector machine, decision tree, logistic regression, random forest, neural network, and Gaussian Naive Bayes classifiers. We used univariable and multivariable weighted logistic regression models to investigate the associations of potential risk factors with type 2 diabetes. RESULTS: All predictive models for type 2 diabetes achieved a high area under the curve (AUC), ranging from 0.7182 to 0.7949. Although the neural network model had the highest accuracy (82.4%), specificity (90.2%), and AUC (0.7949), the decision tree model had the highest sensitivity (51.6%) for type 2 diabetes. We found that people who slept 9 or more hours per day (adjusted odds ratio [aOR] = 1.13, 95% confidence interval [CI], 1.03-1.25) or had checkup frequency of less than 1 year (aOR = 2.31, 95% CI, 1.86-2.85) had higher risk for type 2 diabetes. CONCLUSION: Of the 8 predictive models, the neural network model gave the best model performance with the highest AUC value; however, the decision tree model is preferred for initial screening for type 2 diabetes because it had the highest sensitivity and, therefore, detection rate. We confirmed previously reported risk factors and also identified sleeping time and frequency of checkup as 2 new potential risk factors related to type 2 diabetes.
Assuntos
Árvores de Decisões , Diabetes Mellitus Tipo 2/epidemiologia , Modelos Biológicos , Redes Neurais de Computação , Área Sob a Curva , Sistema de Vigilância de Fator de Risco Comportamental , Humanos , Fatores de RiscoRESUMO
Chronic infection and associated inflammation are key contributors to human carcinogenesis. Ulcerative colitis (UC) is an oxyradical overload disease and is characterized by free radical stress and colon cancer proneness. Here we examined tissues from noncancerous colons of ulcerative colitis patients to determine (a) the activity of two base excision-repair enzymes, AAG, the major 3-methyladenine DNA glycosylase, and APE1, the major apurinic site endonuclease; and (b) the prevalence of microsatellite instability (MSI). AAG and APE1 were significantly increased in UC colon epithelium undergoing elevated inflammation and MSI was positively correlated with their imbalanced enzymatic activities. These latter results were supported by mechanistic studies using yeast and human cell models in which overexpression of AAG and/or APE1 was associated with frameshift mutations and MSI. Our results are consistent with the hypothesis that the adaptive and imbalanced increase in AAG and APE1 is a novel mechanism contributing to MSI in patients with UC and may extend to chronic inflammatory or other diseases with MSI of unknown etiology.
Assuntos
Pareamento Incorreto de Bases , DNA Glicosilases/genética , Reparo do DNA , DNA Liase (Sítios Apurínicos ou Apirimidínicos)/genética , Inflamação/metabolismo , Repetições de Microssatélites , Antígenos CD/biossíntese , Antígenos de Diferenciação Mielomonocítica/biossíntese , Colite Ulcerativa/metabolismo , Colo/metabolismo , Neoplasias Colorretais/metabolismo , Densitometria , Relação Dose-Resposta a Droga , Mutação da Fase de Leitura , Humanos , Imuno-Histoquímica , Células K562 , Fatores de TempoRESUMO
The edgeR package, an R-based tool within the Bioconductor project, offers a flexible statistical framework for detection of changes in abundance based on counts. In this chapter, we illustrate the use of edgeR on a human embryonic stem cell dataset, in particular for RNA-seq and ChIP-seq data. We focus on a step-by-step statistical analysis of differential expression, going from raw data to a list of putative differentially expressed genes and give examples of integrative analysis using the ChIP-seq data. We emphasize data quality spot checks and the use of positive controls throughout the process and give practical recommendations for reproducible research.