RESUMO
BACKGROUND: In shotgun metagenomics, microbial communities are studied through direct sequencing of DNA without any prior cultivation. By comparing gene abundances estimated from the generated sequencing reads, functional differences between the communities can be identified. However, gene abundance data is affected by high levels of systematic variability, which can greatly reduce the statistical power and introduce false positives. Normalization, which is the process where systematic variability is identified and removed, is therefore a vital part of the data analysis. A wide range of normalization methods for high-dimensional count data has been proposed but their performance on the analysis of shotgun metagenomic data has not been evaluated. RESULTS: Here, we present a systematic evaluation of nine normalization methods for gene abundance data. The methods were evaluated through resampling of three comprehensive datasets, creating a realistic setting that preserved the unique characteristics of metagenomic data. Performance was measured in terms of the methods ability to identify differentially abundant genes (DAGs), correctly calculate unbiased p-values and control the false discovery rate (FDR). Our results showed that the choice of normalization method has a large impact on the end results. When the DAGs were asymmetrically present between the experimental conditions, many normalization methods had a reduced true positive rate (TPR) and a high false positive rate (FPR). The methods trimmed mean of M-values (TMM) and relative log expression (RLE) had the overall highest performance and are therefore recommended for the analysis of gene abundance data. For larger sample sizes, CSS also showed satisfactory performance. CONCLUSIONS: This study emphasizes the importance of selecting a suitable normalization methods in the analysis of data from shotgun metagenomics. Our results also demonstrate that improper methods may result in unacceptably high levels of false positives, which in turn may lead to incorrect or obfuscated biological interpretation.
Assuntos
Análise de Dados , MetagenômicaRESUMO
Biology is increasingly dependent on large-scale analysis, such as proteomics, creating a requirement for efficient bioinformatics. Bioinformatic predictions of biological functions rely upon correctly annotated database sequences, and the presence of inaccurately annotated or otherwise poorly described sequences introduces noise and bias to biological analyses. Accurate annotations are, for example, pivotal for correct identification of polypeptide fragments. However, standards for how sequence databases are organized and presented are currently insufficient. Here, we propose five strategies to address fundamental issues in the annotation of sequence databases: (i) to clearly separate experimentally verified and unverified sequence entries; (ii) to enable a system for tracing the origins of annotations; (iii) to separate entries with high-quality, informative annotation from less useful ones; (iv) to integrate automated quality-control software whenever such tools exist; and (v) to facilitate postsubmission editing of annotations and metadata associated with sequences. We believe that implementation of these strategies, for example as requirements for publication of database papers, would enable biology to better take advantage of large-scale data.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Software , Controle de Qualidade , Análise de SequênciaRESUMO
BACKGROUND: Broad-spectrum fluoroquinolone antibiotics are central in modern health care and are used to treat and prevent a wide range of bacterial infections. The recently discovered qnr genes provide a mechanism of resistance with the potential to rapidly spread between bacteria using horizontal gene transfer. As for many antibiotic resistance genes present in pathogens today, qnr genes are hypothesized to originate from environmental bacteria. The vast amount of data generated by shotgun metagenomics can therefore be used to explore the diversity of qnr genes in more detail. RESULTS: In this paper we describe a new method to identify qnr genes in nucleotide sequence data. We show, using cross-validation, that the method has a high statistical power of correctly classifying sequences from novel classes of qnr genes, even for fragments as short as 100 nucleotides. Based on sequences from public repositories, the method was able to identify all previously reported plasmid-mediated qnr genes. In addition, several fragments from novel putative qnr genes were identified in metagenomes. The method was also able to annotate 39 chromosomal variants of which 11 have previously not been reported in literature. CONCLUSIONS: The method described in this paper significantly improves the sensitivity and specificity of identification and annotation of qnr genes in nucleotide sequence data. The predicted novel putative qnr genes in the metagenomic data support the hypothesis of a large and uncharacterized diversity within this family of resistance genes in environmental bacterial communities. An implementation of the method is freely available at http://bioinformatics.math.chalmers.se/qnr/.
Assuntos
Antibacterianos/farmacologia , Farmacorresistência Bacteriana/genética , Proteínas de Escherichia coli/genética , Fluoroquinolonas/farmacologia , Metagenoma/genética , Família Multigênica/genética , Sequência de Bases , Sequenciamento de Nucleotídeos em Larga Escala , Cadeias de Markov , Modelos Genéticos , Alinhamento de SequênciaRESUMO
INTRODUCTION: Neurotrophic tyrosine receptor kinase (NTRK) gene fusions are oncogenic drivers in various tumor types. Limited data exist on the overall survival (OS) of patients with tumors with NTRK gene fusions and on the co-occurrence of NTRK fusions with other oncogenic drivers. MATERIALS AND METHODS: This retrospective study included patients enrolled in the Genomics England 100,000 Genomes Project who had linked clinical data from UK databases. Patients who had undergone tumor whole genome sequencing between March 2016 and July 2019 were included. Patients with and without NTRK fusions were matched. OS was analyzed along with oncogenic alterations in ALK, BRAF, EGFR, ERBB2, KRAS, and ROS1, and tumor mutation burden (TMB) and microsatellite instability (MSI). RESULTS: Of 15,223 patients analyzed, 38 (0.25%) had NTRK gene fusions in 11 tumor types, the most common were breast cancer, colorectal cancer (CRC), and sarcoma. Median OS was not reached in both the NTRK gene fusion-positive and -negative groups (hazard ratio 1.47, 95% CI 0.39-5.57, P = 0.572). A KRAS mutation was identified in two (5%) patients with NTRK gene fusions, and both had hepatobiliary cancer. High TMB and MSI were both more common in patients with NTRK gene fusions, due to the CRC subset. While there was a higher risk of death in patients with NTRK gene fusions compared to those without, the difference was not statistically significant. CONCLUSION: This study supports the hypothesis that NTRK gene fusions are primary oncogenic drivers and the co-occurrence of NTRK gene fusions with other oncogenic alterations is rare.
Assuntos
Neoplasias , Receptor trkA , Humanos , Receptor trkA/genética , Proteínas Tirosina Quinases/genética , Estudos Retrospectivos , Proteínas Proto-Oncogênicas/genética , Neoplasias/genéticaRESUMO
Tumor DNA circulates in the plasma of cancer patients admixed with DNA from noncancerous cells. The genomic landscape of plasma DNA has been characterized in metastatic castration-resistant prostate cancer (mCRPC) but the plasma methylome has not been extensively explored. Here, we performed next-generation sequencing (NGS) on plasma DNA with and without bisulfite treatment from mCRPC patients receiving either abiraterone or enzalutamide in the pre- or post-chemotherapy setting. Principal component analysis on the mCRPC plasma methylome indicated that the main contributor to methylation variance (principal component one, or PC1) was strongly correlated with genomically determined tumor fraction (r = -0.96; P < 10-8) and characterized by hypermethylation of targets of the polycomb repressor complex 2 components. Further deconvolution of the PC1 top-correlated segments revealed that these segments are comprised of methylation patterns specific to either prostate cancer or prostate normal epithelium. To extract information specific to an individual's cancer, we then focused on an orthogonal methylation signature, which revealed enrichment for androgen receptor binding sequences and hypomethylation of these segments associated with AR copy number gain. Individuals harboring this methylation pattern had a more aggressive clinical course. Plasma methylome analysis can accurately quantitate tumor fraction and identify distinct biologically relevant mCRPC phenotypes.
Assuntos
DNA Tumoral Circulante , Metilação de DNA , Epigênese Genética , Regulação Neoplásica da Expressão Gênica , Neoplasias da Próstata , Adulto , Idoso , Idoso de 80 Anos ou mais , DNA Tumoral Circulante/sangue , DNA Tumoral Circulante/genética , Estudo de Associação Genômica Ampla , Humanos , Masculino , Pessoa de Meia-Idade , Metástase Neoplásica , Neoplasias da Próstata/sangue , Neoplasias da Próstata/genética , Neoplasias da Próstata/patologiaRESUMO
Integrons are genetic elements that facilitate the horizontal gene transfer in bacteria and are known to harbor genes associated with antibiotic resistance. The gene mobility in the integrons is governed by the presence of attC sites, which are 55 to 141-nucleotide-long imperfect inverted repeats. Here we present HattCI, a new method for fast and accurate identification of attC sites in large DNA data sets. The method is based on a generalized hidden Markov model that describes each core component of an attC site individually. Using twofold cross-validation experiments on a manually curated reference data set of 231 attC sites from class 1 and 2 integrons, HattCI showed high sensitivities of up to 91.9% while maintaining satisfactory false-positive rates. When applied to a metagenomic data set of 35 microbial communities from different environments, HattCI found a substantially higher number of attC sites in the samples that are known to contain more horizontally transferred elements. HattCI will significantly increase the ability to identify attC sites and thus integron-mediated genes in genomic and metagenomic data. HattCI is implemented in C and is freely available at http://bioinformatics.math.chalmers.se/HattCI .