RESUMEN
Whole-genome sequencing has become an essential tool for real-time genomic surveillance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) worldwide. The handling of raw next-generation sequencing (NGS) data is a major challenge for sequencing laboratories. We developed an easy-to-use web-based application (EPISEQ SARS-CoV-2) to analyse SARS-CoV-2 NGS data generated on common sequencing platforms using a variety of commercially available reagents. This application performs in one click a quality check, a reference-based genome assembly, and the analysis of the generated consensus sequence as to coverage of the reference genome, mutation screening and variant identification according to the up-to-date Nextstrain clade and Pango lineage. In this study, we validated the EPISEQ SARS-CoV-2 pipeline against a reference pipeline and compared the performance of NGS data generated by different sequencing protocols using EPISEQ SARS-CoV-2. We showed a strong agreement in SARS-CoV-2 clade and lineage identification (>99%) and in spike mutation detection (>99%) between EPISEQ SARS-CoV-2 and the reference pipeline. The comparison of several sequencing approaches using EPISEQ SARS-CoV-2 revealed 100% concordance in clade and lineage classification. It also uncovered reagent-related sequencing issues with a potential impact on SARS-CoV-2 mutation reporting. Altogether, EPISEQ SARS-CoV-2 allows an easy, rapid and reliable analysis of raw NGS data to support the sequencing efforts of laboratories with limited bioinformatics capacity and those willing to accelerate genomic surveillance of SARS-CoV-2.
Asunto(s)
COVID-19 , SARS-CoV-2 , COVID-19/diagnóstico , Genoma Viral , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Mutación , SARS-CoV-2/genéticaRESUMEN
We have previously studied carbapenem non-susceptible Pseudomonas aeruginosa (CNPA) strains from intensive care units (ICUs) in a referral hospital in Jakarta, Indonesia (Pelegrin et al., 2019). We documented that CNPA transmissions and acquisitions among patients were variable over time and that these were not significantly reduced by a set of infection control measures. Three high risk international CNPA clones (sequence type (ST)235, ST823, ST357) dominated, and carbapenem resistance was due to carbapenemase-encoding genes and mutations in the porin OprD. Pelegrin et al. (2019) reported core genome analysis of these strains. We present a more refined and detailed whole genome-based analysis of major clones represented in the same dataset. As per our knowledge, this is the first study reporting Single Nucleotide Polymorphisms (wgSNP) analysis of Pseudomonas strains. With whole genome-based Multi Locus Sequence Typing (wgMLST) of the 3 CNPA clones (ST235, ST357 and ST823), three to eleven subgroups with up to 200 allelic variants were observed for each of the CNPA clones. Furthermore, we analyzed these CNPA clone clusters for the presence of wgSNP to redefine CNPA transmission events during hospitalization. A maximum number 35350 SNPs (including non-informative wgSNPs) and 398 SNPs (ST-specific_informative-wgSNPs) were found in ST235, 34,570 SNPs (including non-informative wgSNPs) and 111 SNPs (ST-specific_informative-wgSNPs) in ST357 and 26,443 SNPs (including non-informative SNPs) and 61 SNPs (ST-specific_informative-wgSNPs) in ST823. ST-specific_Informative-wgSNPs were commonly noticed in sensor-response regulator genes. However, the majority of non-informative wgSNPs was found in conserved hypothetical proteins or in uncharacterized proteins. Of note, antibiotic resistance and virulence genes segregated according to the wgSNP analyses. A total of 8 transmission chains for ST235 strains followed by 9 and 4 possible transmission chains for ST357 and ST823 were traceable on the basis of pairwise distances of informative-wgSNPs (0 to 4 SNPs) among the strains. The present study demonstrates the value of detailed whole genome sequence analysis for highly refined epidemiological analysis of P. aeruginosa.
RESUMEN
BACKGROUND: Pseudomonas aeruginosa is a ubiquitous environmental microorganism and also a common cause of infection. Its ability to survive in many different environments and persistently colonize humans is linked to its presence in biofilms formed on indwelling device surfaces. Biofilm promotes adhesion to, and survival on surfaces, protects from desiccation and the actions of antibiotics and disinfectants. RESULTS: We examined the genetic basis for biofilm production on polystyrene at room (22 °C) and body temperature (37 °C) within 280 P. aeruginosa. 193 isolates (69 %) produced more biofilm at 22 °C than at 37 °C. Using GWAS and pan-GWAS, we found a number of accessory genes significantly associated with greater biofilm production at 22 °C. Many of these are present on a 165 kb region containing genes for heavy metal resistance (arsenic, copper, mercury and cadmium), transcriptional regulators and methytransferases. We also discovered multiple core genome SNPs in the A-type flagellin gene and Type II secretion system gene xpsD. Analysis of biofilm production of isolates of the MDR ST111 and ST235 lineages on stainless-steel revealed several accessory genes associated with enhanced biofilm production. These include a putative translocase with homology to a Helicobacter pylori type IV secretion system protein, a TA system II toxin gene and the alginate biosynthesis gene algA, several transcriptional regulators and methytransferases as well as core SNPs in genes involved in quorum sensing and protein translocation. CONCLUSIONS: Using genetic association approaches we discovered a number of accessory genes and core-genome SNPs that were associated with enhanced early biofilm formation at 22 °C compared to 37 °C. These included a 165 kb genomic island containing multiple heavy metal resistance genes, transcriptional regulators and methyltransferases. We hypothesize that this genomic island may be associated with overall genotypes that are environmentally adapted to survive at lower temperatures. Further work to examine their importance in, for example gene-knockout studies, are required to confirm their relevance. GWAS and pan-GWAS approaches have great potential as a first step in examining the genetic basis of novel bacterial phenotypes.
Asunto(s)
Biopelículas , Farmacorresistencia Bacteriana Múltiple , Pseudomonas aeruginosa , Antibacterianos/farmacología , Genotipo , Humanos , Infecciones por Pseudomonas , Pseudomonas aeruginosa/efectos de los fármacos , Pseudomonas aeruginosa/genética , Percepción de QuorumRESUMEN
Clostridioides difficile is a cause of health care-associated infections. The epidemiological study of C. difficile infection (CDI) traditionally involves PCR ribotyping. However, ribotyping will be increasingly replaced by whole genome sequencing (WGS). This implies that WGS types need correlation with classical ribotypes (RTs) in order to perform retrospective clinical studies. Here, we selected genomes of hyper-virulent C. difficile strains of RT001, RT017, RT027, RT078, and RT106 to try and identify new discriminatory markers using in silico ribotyping PCR and De Bruijn graph-based Genome Wide Association Studies (DBGWAS). First, in silico ribotyping PCR was performed using reference primer sequences and 30 C. difficile genomes of the five different RTs identified above. Second, discriminatory genomic markers were sought with DBGWAS using a set of 160 independent C. difficile genomes (14 ribotypes). RT-specific genetic polymorphisms were annotated and validated for their specificity and sensitivity against a larger dataset of 2425 C. difficile genomes covering 132 different RTs. In silico PCR ribotyping was unsuccessful due to non-specific or missing theoretical RT PCR fragments. More successfully, DBGWAS discovered a total of 47 new markers (13 in RT017, 12 in RT078, 9 in RT106, 7 in RT027, and 6 in RT001) with minimum q-values of 0 to 7.40 × 10-5, indicating excellent marker selectivity. The specificity and sensitivity of individual markers ranged between 0.92 and 1.0 but increased to 1 by combining two markers, hence providing undisputed RT identification based on a single genome sequence. Markers were scattered throughout the C. difficile genome in intra- and intergenic regions. We propose here a set of new genomic polymorphisms that efficiently identify five hyper-virulent RTs utilizing WGS data only. Further studies need to show whether this initial proof-of-principle observation can be extended to all 600 existing RTs.
RESUMEN
BACKGROUND: Recent years have witnessed the development of several k-mer-based approaches aiming to predict phenotypic traits of bacteria on the basis of their whole-genome sequences. While often convincing in terms of predictive performance, the underlying models are in general not straightforward to interpret, the interplay between the actual genetic determinant and its translation as k-mers being generally hard to decipher. RESULTS: We propose a simple and computationally efficient strategy allowing one to cope with the high correlation inherent to k-mer-based representations in supervised machine learning models, leading to concise and easily interpretable signatures. We demonstrate the benefit of this approach on the task of predicting the antibiotic resistance profile of a Klebsiella pneumoniae strain from its genome, where our method leads to signatures defined as weighted linear combinations of genetic elements that can easily be identified as genuine antibiotic resistance determinants, with state-of-the-art predictive performance. CONCLUSIONS: By enhancing the interpretability of genomic k-mer-based antibiotic resistance prediction models, our approach improves their clinical utility and hence will facilitate their adoption in routine diagnostics by clinicians and microbiologists. While antibiotic resistance was the motivating application, the method is generic and can be transposed to any other bacterial trait. An R package implementing our method is available at https://gitlab.com/biomerieux-data-science/clustlasso.
Asunto(s)
Algoritmos , Programas Informáticos , Farmacorresistencia Microbiana , Genoma , GenómicaRESUMEN
Genome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or detailed assessment of marker effect. Recently, alignment-free methods based on k-mer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are sometimes hard to interpret. Here we introduce DBGWAS, an extended k-mer-based GWAS method producing interpretable genetic variants associated with distinct phenotypes. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes, identified by the association model, into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is alignment-free and only requires a set of contigs and phenotypes. In particular, it does not require prior annotation or reference genomes. It produces subgraphs representing phenotype-associated genetic variants such as local polymorphisms and mobile genetic elements (MGE). It offers a graphical framework which helps interpret GWAS results. Importantly it is also computationally efficient-experiments took one hour and a half on average. We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis, and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa-along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. An open-source tool implementing DBGWAS is available at https://gitlab.com/leoisl/dbgwas.
Asunto(s)
Genoma Bacteriano , Estudio de Asociación del Genoma Completo/métodos , Gráficos por Computador , ADN Bacteriano/genética , Bases de Datos Genéticas , Farmacorresistencia Bacteriana/genética , Variación Genética , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Secuencias Repetitivas Esparcidas , Modelos Genéticos , Mycobacterium tuberculosis/efectos de los fármacos , Mycobacterium tuberculosis/genética , Fenotipo , Pseudomonas aeruginosa/efectos de los fármacos , Pseudomonas aeruginosa/genética , Análisis de Secuencia de ADN , Programas Informáticos , Staphylococcus aureus/efectos de los fármacos , Staphylococcus aureus/genéticaRESUMEN
Genetic determinants of antibiotic resistance (AR) have been extensively investigated. High-throughput sequencing allows for the assessment of the relationship between genotype and phenotype. A panel of 672 Pseudomonas aeruginosa strains was analysed, including representatives of globally disseminated multidrug-resistant and extensively drug-resistant clones; genomes and multiple antibiograms were available. This panel was annotated for AR gene presence and polymorphism, defining a resistome in which integrons were included. Integrons were present in >70 distinct cassettes, with In5 being the most prevalent. Some cassettes closely associated with clonal complexes, whereas others spread across the phylogenetic diversity, highlighting the importance of horizontal transfer. A resistome-wide association study (RWAS) was performed for clinically relevant antibiotics by correlating the variability in minimum inhibitory concentration (MIC) values with resistome data. Resistome annotation identified 147 loci associated with AR. These loci consisted mainly of acquired genomic elements and intrinsic genes. The RWAS allowed for correct identification of resistance mechanisms for meropenem, amikacin, levofloxacin and cefepime, and added 46 novel mutations. Among these, 29 were variants of the oprD gene associated with variation in meropenem MIC. Using genomic and MIC data, phenotypic AR was successfully correlated with molecular determinants at the whole-genome sequence level.
Asunto(s)
Antibacterianos/farmacología , Farmacorresistencia Bacteriana , Genes Bacterianos , Genotipo , Pseudomonas aeruginosa/efectos de los fármacos , Pseudomonas aeruginosa/genética , Sitios Genéticos , Humanos , Secuencias Repetitivas Esparcidas , Pruebas de Sensibilidad Microbiana , Infecciones por Pseudomonas/microbiología , Pseudomonas aeruginosa/aislamiento & purificaciónRESUMEN
INTRODUCTION: Antimicrobial susceptibility testing is key in modern clinical microbiology. With pandemic emergence of (multi-)antibiotic resistance, methods to detect and quantify resistance of clinically important bacterial species are imperative. Historically, antimicrobial susceptibility testing (AST) was mostly performed using methods relying on bacterial growth. Such methods may be time-consuming and more rapid alternatives have been actively sought for. Areas covered: Among the new AST methods there are many that focus on detection of causal resistance genes and/or gene mutations. The approaches most used are based on nucleic acid amplification and, more recently, high-throughput (next generation) sequencing of amplified targets and complete microbial genomes. The authors provide a review of PCR-mediated and genomic AST methods used for human and veterinary pathogens and show where these approaches work well or may become difficult to interpret. Expert commentary: Microbial genome sequencing will play an important role in the field of AST, but there remain issues to be resolved. These include the development of user friendly data analysis, reducing the duration and cost of sequencing and comprehensiveness of the databases. In addition, clinical evaluation studies need to be performed involving real-life patients.
Asunto(s)
Antiinfecciosos , Resistencia a Medicamentos/genética , Genómica/métodos , Técnicas de Amplificación de Ácido Nucleico/métodos , HumanosRESUMEN
MOTIVATION: Alignment-based taxonomic binning for metagenome characterization proceeds in two steps: reads mapping against a reference database (RDB) and taxonomic assignment according to the best hits. Beyond the sequencing technology and the completeness of the RDB, selecting the optimal configuration of the workflow, in particular the mapper parameters and the best hit selection threshold, to get the highest binning performance remains quite empirical. RESULTS: We developed a statistical framework to perform such optimization at a minimal computational cost. Using an optimization experimental design and simulated datasets for three sequencing technologies, we built accurate prediction models for five performance indicators and then derived the parameter configuration providing the optimal performance. Whatever the mapper and the dataset, we observed that the optimal configuration yielded better performance than the default configuration and that the best hit selection threshold had a large impact on performance. Finally, on a reference dataset from the Human Microbiome Project, we confirmed that the optimized configuration increased the performance compared with the default configuration. AVAILABILITY AND IMPLEMENTATION: Not applicable. CONTACT: magali.dancette@biomerieux.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Metagenómica , Algoritmos , Humanos , Metagenoma , Microbiota , Modelos TeóricosRESUMEN
BACKGROUND: Biological pathways are descriptive diagrams of biological processes widely used for functional analysis of differentially expressed genes or proteins. Primary data analysis, such as quality control, normalisation, and statistical analysis, is often performed in scripting languages like R, Perl, and Python. Subsequent pathway analysis is usually performed using dedicated external applications. Workflows involving manual use of multiple environments are time consuming and error prone. Therefore, tools are needed that enable pathway analysis directly within the same scripting languages used for primary data analyses. Existing tools have limited capability in terms of available pathway content, pathway editing and visualisation options, and export file formats. Consequently, making the full-fledged pathway analysis tool PathVisio available from various scripting languages will benefit researchers. RESULTS: We developed PathVisioRPC, an XMLRPC interface for the pathway analysis software PathVisio. PathVisioRPC enables creating and editing biological pathways, visualising data on pathways, performing pathway statistics, and exporting results in several image formats in multiple programming environments. We demonstrate PathVisioRPC functionalities using examples in Python. Subsequently, we analyse a publicly available NCBI GEO gene expression dataset studying tumour bearing mice treated with cyclophosphamide in R. The R scripts demonstrate how calls to existing R packages for data processing and calls to PathVisioRPC can directly work together. To further support R users, we have created RPathVisio simplifying the use of PathVisioRPC in this environment. We have also created a pathway module for the microarray data analysis portal ArrayAnalysis.org that calls the PathVisioRPC interface to perform pathway analysis. This module allows users to use PathVisio functionality online without having to download and install the software and exemplifies how the PathVisioRPC interface can be used by data analysis pipelines for functional analysis of processed genomics data. CONCLUSIONS: PathVisioRPC enables data visualisation and pathway analysis directly from within various analytical environments used for preliminary analyses. It supports the use of existing pathways from WikiPathways or pathways created using the RPC itself. It also enables automation of tasks performed using PathVisio, making it useful to PathVisio users performing repeated visualisation and analysis tasks. PathVisioRPC is freely available for academic and commercial use at http://projects.bigcat.unimaas.nl/pathvisiorpc.
Asunto(s)
Biomarcadores de Tumor/genética , Gráficos por Computador , Regulación Neoplásica de la Expresión Génica/efectos de los fármacos , Genómica/métodos , Neoplasias/genética , Transducción de Señal/efectos de los fármacos , Programas Informáticos , Animales , Automatización , Ciclofosfamida , Perfilación de la Expresión Génica , Redes Reguladoras de Genes , Ratones , Neoplasias/tratamiento farmacológico , Flujo de TrabajoRESUMEN
Quality control (QC) is crucial for any scientific method producing data. Applying adequate QC introduces new challenges in the genomics field where large amounts of data are produced with complex technologies. For DNA microarrays, specific algorithms for QC and pre-processing including normalization have been developed by the scientific community, especially for expression chips of the Affymetrix platform. Many of these have been implemented in the statistical scripting language R and are available from the Bioconductor repository. However, application is hampered by lack of integrative tools that can be used by users of any experience level. To fill this gap, we developed a freely available tool for QC and pre-processing of Affymetrix gene expression results, extending, integrating and harmonizing functionality of Bioconductor packages. The tool can be easily accessed through a wizard-like web portal at http://www.arrayanalysis.org or downloaded for local use in R. The portal provides extensive documentation, including user guides, interpretation help with real output illustrations and detailed technical documentation. It assists newcomers to the field in performing state-of-the-art QC and pre-processing while offering data analysts an integral open-source package. Providing the scientific community with this easily accessible tool will allow improving data quality and reuse and adoption of standards.
Asunto(s)
Perfilación de la Expresión Génica/normas , Análisis de Secuencia por Matrices de Oligonucleótidos/normas , Programas Informáticos , Perfilación de la Expresión Génica/métodos , Internet , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Control de Calidad , Interfaz Usuario-ComputadorRESUMEN
Human endogenous retroviruses (HERVs) are spread throughout the genome and their long terminal repeats (LTRs) constitute a wide collection of putative regulatory sequences. Phylogenetic similarities and the profusion of integration sites, two inherent characteristics of transposable elements, make it difficult to study individual locus expression in a large-scale approach, and historically apart from some placental and testis-regulated elements, it was generally accepted that HERVs are silent due to epigenetic control. Herein, we have introduced a generic method aiming to optimally characterize individual loci associated with 25-mer probes by minimizing cross-hybridization risks. We therefore set up a microarray dedicated to a collection of 5,573 HERVs that can reasonably be assigned to a unique genomic position. We obtained a first view of the HERV transcriptome by using a composite panel of 40 normal and 39 tumor samples. The experiment showed that almost one third of the HERV repertoire is indeed transcribed. The HERV transcriptome follows tropism rules, is sensitive to the state of differentiation and, unexpectedly, seems not to correlate with the age of the HERV families. The probeset definition within the U3 and U5 regions was used to assign a function to some LTRs (i.e. promoter or polyA) and revealed that (i) autonomous active LTRs are broadly subjected to operational determinism (ii) the cellular gene density is substantially higher in the surrounding environment of active LTRs compared to silent LTRs and (iii) the configuration of neighboring cellular genes differs between active and silent LTRs, showing an approximately 8 kb zone upstream of promoter LTRs characterized by a drastic reduction in sense cellular genes. These gathered observations are discussed in terms of virus/host adaptive strategies, and together with the methods and tools developed for this purpose, this work paves the way for further HERV transcriptome projects.
Asunto(s)
Retrovirus Endógenos/genética , Análisis de Secuencia por Matrices de Oligonucleótidos , TranscriptomaRESUMEN
BACKGROUND: The combination of chromatin immunoprecipitation with two-channel microarray technology enables genome-wide mapping of binding sites of DNA-interacting proteins (ChIP-on-chip) or sites with methylated CpG di-nucleotides (DNA methylation microarray). These powerful tools are the gateway to understanding gene transcription regulation. Since the goals of such studies, the sample preparation procedures, the microarray content and study design are all different from transcriptomics microarrays, the data pre-processing strategies traditionally applied to transcriptomics microarrays may not be appropriate. Particularly, the main challenge of the normalization of "regulation microarrays" is (i) to make the data of individual microarrays quantitatively comparable and (ii) to keep the signals of the enriched probes, representing DNA sequences from the precipitate, as distinguishable as possible from the signals of the un-enriched probes, representing DNA sequences largely absent from the precipitate. RESULTS: We compare several widely used normalization approaches (VSN, LOWESS, quantile, T-quantile, Tukey's biweight scaling, Peng's method) applied to a selection of regulation microarray datasets, ranging from DNA methylation to transcription factor binding and histone modification studies. Through comparison of the data distributions of control probes and gene promoter probes before and after normalization, and assessment of the power to identify known enriched genomic regions after normalization, we demonstrate that there are clear differences in performance between normalization procedures. CONCLUSION: T-quantile normalization applied separately on the channels and Tukey's biweight scaling outperform other methods in terms of the conservation of enriched and un-enriched signal separation, as well as in identification of genomic regions known to be enriched. T-quantile normalization is preferable as it additionally improves comparability between microarrays. In contrast, popular normalization approaches like quantile, LOWESS, Peng's method and VSN normalization alter the data distributions of regulation microarrays to such an extent that using these approaches will impact the reliability of the downstream analysis substantially.
Asunto(s)
Metilación de ADN , ADN/metabolismo , Estudio de Asociación del Genoma Completo/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/normas , Sitios de Unión , Inmunoprecipitación de Cromatina , Islas de CpG , Bases de Datos Factuales , Estudio de Asociación del Genoma Completo/instrumentación , Curva ROCRESUMEN
Biological pathways are abstract and functional visual representations of existing biological knowledge. By mapping high-throughput data on these representations, changes and patterns in biological systems on the genetic, metabolic and protein level are instantly assessable. Many public domain repositories exist for storing biological pathways, each applying its own conventions and storage format. A pathway-based content review of these repositories reveals that none of them are comprehensive. To address this issue, we apply a general workflow to create curated biological pathways, in which we combine three content sources: public domain databases, literature and experts. In this workflow all content of a particular biological pathway is manually retrieved from biological pathway databases and literature, after which this content is compared, combined and subsequently curated by experts. From the curated content, new biological pathways can be created for a pathway analysis tool of choice and distributed among its user base. We applied this procedure to construct high-quality curated biological pathways involved in human fatty acid metabolism.