RESUMO
Genome-wide association study (GWAS) methods applied to bacterial genomes have shown promising results for genetic marker discovery or detailed assessment of marker effect. Recently, alignment-free methods based on k-mer composition have proven their ability to explore the accessory genome. However, they lead to redundant descriptions and results which are sometimes hard to interpret. Here we introduce DBGWAS, an extended k-mer-based GWAS method producing interpretable genetic variants associated with distinct phenotypes. Relying on compacted De Bruijn graphs (cDBG), our method gathers cDBG nodes, identified by the association model, into subgraphs defined from their neighbourhood in the initial cDBG. DBGWAS is alignment-free and only requires a set of contigs and phenotypes. In particular, it does not require prior annotation or reference genomes. It produces subgraphs representing phenotype-associated genetic variants such as local polymorphisms and mobile genetic elements (MGE). It offers a graphical framework which helps interpret GWAS results. Importantly it is also computationally efficient-experiments took one hour and a half on average. We validated our method using antibiotic resistance phenotypes for three bacterial species. DBGWAS recovered known resistance determinants such as mutations in core genes in Mycobacterium tuberculosis, and genes acquired by horizontal transfer in Staphylococcus aureus and Pseudomonas aeruginosa-along with their MGE context. It also enabled us to formulate new hypotheses involving genetic variants not yet described in the antibiotic resistance literature. An open-source tool implementing DBGWAS is available at https://gitlab.com/leoisl/dbgwas.
Assuntos
Genoma Bacteriano , Estudo de Associação Genômica Ampla/métodos , Gráficos por Computador , DNA Bacteriano/genética , Bases de Dados Genéticas , Farmacorresistência Bacteriana/genética , Variação Genética , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Sequências Repetitivas Dispersas , Modelos Genéticos , Mycobacterium tuberculosis/efeitos dos fármacos , Mycobacterium tuberculosis/genética , Fenótipo , Pseudomonas aeruginosa/efeitos dos fármacos , Pseudomonas aeruginosa/genética , Análise de Sequência de DNA , Software , Staphylococcus aureus/efeitos dos fármacos , Staphylococcus aureus/genéticaRESUMO
BACKGROUND: Several studies demonstrated the feasibility of predicting bacterial antibiotic resistance phenotypes from whole-genome sequences, the prediction process usually amounting to detecting the presence of genes involved in antibiotic resistance mechanisms, or of specific mutations, previously identified from a training panel of strains, within these genes. We address the problem from the supervised statistical learning perspective, not relying on prior information about such resistance factors. We rely on a k-mer based genotyping scheme and a logistic regression model, thereby combining several k-mers into a probabilistic model. To identify a small yet predictive set of k-mers, we rely on the stability selection approach (Meinshausen et al., J R Stat Soc Ser B 72:417-73, 2010), that consists in penalizing logistic regression models with a Lasso penalty, coupled with extensive resampling procedures. RESULTS: Using public datasets, we applied the resulting classifiers to two bacterial species and achieved predictive performance equivalent to state of the art. The models are extremely sparse, involving 1 to 8 k-mers per antibiotic, hence are remarkably easy and fast to evaluate on new genomes (from raw reads to assemblies). CONCLUSION: Our proof of concept therefore demonstrates that stability selection is a powerful approach to investigate bacterial genotype-phenotype relationships.
Assuntos
Algoritmos , Farmacorresistência Bacteriana/genética , Genoma Bacteriano , Mycobacterium tuberculosis/genética , Staphylococcus aureus/genética , Sequenciamento Completo do Genoma , Antibacterianos/farmacologia , Sequência de Bases , Bases de Dados Genéticas , Farmacorresistência Bacteriana/efeitos dos fármacos , Modelos Logísticos , Modelos Genéticos , Mycobacterium tuberculosis/efeitos dos fármacos , Curva ROC , Reprodutibilidade dos Testes , Staphylococcus aureus/efeitos dos fármacosRESUMO
MOTIVATION: Alignment-based taxonomic binning for metagenome characterization proceeds in two steps: reads mapping against a reference database (RDB) and taxonomic assignment according to the best hits. Beyond the sequencing technology and the completeness of the RDB, selecting the optimal configuration of the workflow, in particular the mapper parameters and the best hit selection threshold, to get the highest binning performance remains quite empirical. RESULTS: We developed a statistical framework to perform such optimization at a minimal computational cost. Using an optimization experimental design and simulated datasets for three sequencing technologies, we built accurate prediction models for five performance indicators and then derived the parameter configuration providing the optimal performance. Whatever the mapper and the dataset, we observed that the optimal configuration yielded better performance than the default configuration and that the best hit selection threshold had a large impact on performance. Finally, on a reference dataset from the Human Microbiome Project, we confirmed that the optimized configuration increased the performance compared with the default configuration. AVAILABILITY AND IMPLEMENTATION: Not applicable. CONTACT: magali.dancette@biomerieux.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Metagenômica , Algoritmos , Humanos , Metagenoma , Microbiota , Modelos TeóricosRESUMO
MOTIVATION: Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions. RESULTS: We propose a new rank-flexible machine learning-based compositional approach for taxonomic assignment of metagenomics reads and show that it benefits from increasing the number of fragments sampled from reference genome to tune its parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning the method involves training machine learning models on about 10(8) samples in 10(7) dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting method is competitive in terms of accuracy with well-established alignment and composition-based tools for problems involving a small to moderate number of candidate species and for reasonable amounts of sequencing errors. We show, however, that machine learning-based compositional approaches are still limited in their ability to deal with problems involving a greater number of species and more sensitive to sequencing errors. We finally show that the new method outperforms the state-of-the-art in its ability to classify reads from species of lineage absent from the reference database and confirm that compositional approaches achieve faster prediction times, with a gain of 2-17 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise. AVAILABILITY AND IMPLEMENTATION: Data and codes are available at http://cbio.ensmp.fr/largescalemetagenomics CONTACT: pierre.mahe@biomerieux.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Aprendizado de Máquina , Metagenômica , Análise de Sequência de DNA , Algoritmos , Metagenoma , SoftwareRESUMO
BACKGROUND: Construction and validation of a prognostic model for survival data in the clinical domain is still an active field of research. Nevertheless there is no consensus on how to develop routine prognostic tests based on a combination of RT-qPCR biomarkers and clinical or demographic variables. In particular, the estimation of the model performance requires to properly account for the RT-qPCR experimental design. RESULTS: We present a strategy to build, select, and validate a prognostic model for survival data based on a combination of RT-qPCR biomarkers and clinical or demographic data and we provide an illustration on a real clinical dataset. First, we compare two cross-validation schemes: a classical outcome-stratified cross-validation scheme and an alternative one that accounts for the RT-qPCR plate design, especially when samples are processed by batches. The latter is intended to limit the performance discrepancies, also called the validation surprise, between the training and the test sets. Second, strategies for model building (covariate selection, functional relationship modeling, and statistical model) as well as performance indicators estimation are presented. Since in practice several prognostic models can exhibit similar performances, complementary criteria for model selection are discussed: the stability of the selected variables, the model optimism, and the impact of the omitted variables on the model performance. CONCLUSION: On the training dataset, appropriate resampling methods are expected to prevent from any upward biases due to unaccounted technical and biological variability that may arise from the experimental and intrinsic design of the RT-qPCR assay. Moreover, the stability of the selected variables, the model optimism, and the impact of the omitted variables on the model performances are pivotal indicators to select the optimal model to be validated on the test dataset.
Assuntos
Expressão Gênica , Modelos de Riscos Proporcionais , Reação em Cadeia da Polimerase em Tempo Real , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Biomarcadores , Humanos , Prognóstico , Choque Séptico/mortalidadeRESUMO
Preterm newborns are at high risk of neurological injury. In this population, we investigated the link between neurological complications and sleep architecture. At term-corrected gestational age, we studied retrospectively the polysomnography of 45 preterm infants born at < 28 weeks or weighting < 1 kg. These infants were followed-up by a neuropaediatrician (median age at last follow-up 50.4 months). Two groups of children were constituted: a group without neurological disorder and a second group with at least one of the following: cerebral palsy, language or mental retardation, visual or hearing disability or attention disorder. A Multiple Indicators and Multiple Causes model assessed the relationship between the neurological outcome and two sleep components: spontaneous arousability [number of awakenings and movements per hour of quiet sleep (QS) and active sleep] and QS characteristics (median duration of QS cycles and percentage of QS over total sleep time). Twenty-six infants had an impaired neurological outcome. There were no statistical differences between the two groups regarding clinical characteristics. Compared to preterm neonates with normal neurological outcome, those with impaired outcomes had a lower spontaneous arousability; i.e. 0.7 (0.51) times less awakenings and movements per hour of QS and 0.9 (0.81) times less per hour of active sleep than infants with normal outcomes (P = 0.05). The differences in QS characteristics did not reach statistical significance. These findings suggested that, in preterm infants, perinatal neurological injuries could be associated with an abnormal sleep architecture characterized by altered spontaneous arousability.
Assuntos
Nível de Alerta , Doenças do Prematuro/fisiopatologia , Recém-Nascido Prematuro , Doenças do Sistema Nervoso/fisiopatologia , Transtornos do Despertar do Sono/fisiopatologia , Sono , Vigília , Adulto , Nível de Alerta/fisiologia , Paralisia Cerebral/complicações , Paralisia Cerebral/congênito , Paralisia Cerebral/fisiopatologia , Pré-Escolar , Feminino , Idade Gestacional , Humanos , Lactente , Recém-Nascido , Recém-Nascido Prematuro/fisiologia , Doenças do Prematuro/etiologia , Deficiência Intelectual/complicações , Deficiência Intelectual/fisiopatologia , Masculino , Idade Materna , Movimento , Doenças do Sistema Nervoso/complicações , Doenças do Sistema Nervoso/congênito , Polissonografia , Estudos Retrospectivos , Transtornos de Sensação/complicações , Transtornos de Sensação/congênito , Transtornos de Sensação/fisiopatologia , Sono/fisiologia , Transtornos do Despertar do Sono/complicações , Transtornos do Despertar do Sono/congênito , Vigília/fisiologiaRESUMO
Whole-genome sequencing has become an essential tool for real-time genomic surveillance of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) worldwide. The handling of raw next-generation sequencing (NGS) data is a major challenge for sequencing laboratories. We developed an easy-to-use web-based application (EPISEQ SARS-CoV-2) to analyse SARS-CoV-2 NGS data generated on common sequencing platforms using a variety of commercially available reagents. This application performs in one click a quality check, a reference-based genome assembly, and the analysis of the generated consensus sequence as to coverage of the reference genome, mutation screening and variant identification according to the up-to-date Nextstrain clade and Pango lineage. In this study, we validated the EPISEQ SARS-CoV-2 pipeline against a reference pipeline and compared the performance of NGS data generated by different sequencing protocols using EPISEQ SARS-CoV-2. We showed a strong agreement in SARS-CoV-2 clade and lineage identification (>99%) and in spike mutation detection (>99%) between EPISEQ SARS-CoV-2 and the reference pipeline. The comparison of several sequencing approaches using EPISEQ SARS-CoV-2 revealed 100% concordance in clade and lineage classification. It also uncovered reagent-related sequencing issues with a potential impact on SARS-CoV-2 mutation reporting. Altogether, EPISEQ SARS-CoV-2 allows an easy, rapid and reliable analysis of raw NGS data to support the sequencing efforts of laboratories with limited bioinformatics capacity and those willing to accelerate genomic surveillance of SARS-CoV-2.
Assuntos
COVID-19 , SARS-CoV-2 , COVID-19/diagnóstico , Genoma Viral , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Mutação , SARS-CoV-2/genéticaRESUMO
OBJECTIVE: To illustrate the advantages of structural equation models in biomedical research using the complex example of cystic fibrosis. MATERIAL AND METHODS: 595 blood samples from 312 patients were tested. The model studied the effects of age, BMI and clinical condition on seven major latent variables: pulmonary function, lipid oxidation status, vitamins A and E, glutathione, carotenoids, two essential fatty acids and arachidonic acid. RESULTS: The model confirmed previous associations: positive (fatty acids, arachidonic acid, carotenoids and vitamins with pulmonary function and with lipid oxidation) and negative (glutathione with pulmonary function). It also verified the decrease in fatty acids during bronchial exacerbation and the increase in fatty acids and lipid oxidation after antibiotic treatment. Above all, the model revealed new positive associations between lipid oxidation and carotenoid levels and between lipid oxidation and vitamin A and E levels. CONCLUSIONS: Structural equations dealt easily with the great number of outcome variables of the example. They deserve a central place in biomedical issues involving too many correlated factors to help physicians and statisticians conceive biological models that best represent reality.
Assuntos
Fibrose Cística/metabolismo , Fibrose Cística/fisiopatologia , Ácidos Graxos/metabolismo , Modelos Biológicos , Adolescente , Adulto , Antioxidantes/metabolismo , Ácido Araquidônico/metabolismo , Criança , Pré-Escolar , Fibrose Cística/patologia , Humanos , Lactente , Pessoa de Meia-Idade , Oxirredução , Testes de Função Respiratória , Solubilidade , Vitaminas/metabolismoRESUMO
Recent years saw a growing interest in predicting antibiotic resistance from whole-genome sequencing data, with promising results obtained for Staphylococcus aureus and Mycobacterium tuberculosis. In this work, we gathered 6,574 sequencing read datasets of M. tuberculosis public genomes with associated antibiotic resistance profiles for both first and second-line antibiotics. We performed a systematic evaluation of TBProfiler and Mykrobe, two widely recognized softwares allowing to predict resistance in M. tuberculosis. The size of the dataset allowed us to obtain confident estimations of their overall predictive performance, to assess precisely the individual predictive power of the markers they rely on, and to study in addition how these softwares behave across the major M. tuberculosis lineages. While this study confirmed the overall good performance of these tools, it revealed that an important fraction of the catalog of mutations they embed is of limited predictive power. It also revealed that these tools offer different sensitivity/specificity trade-offs, which is mainly due to the different sets of mutation they embed but also to their underlying genotyping pipelines. More importantly, it showed that their level of predictive performance varies greatly across lineages for some antibiotics, therefore suggesting that the predictions made by these softwares should be deemed more or less confident depending on the lineage inferred and the predictive performance of the marker(s) actually detected. Finally, we evaluated the relevance of machine learning approaches operating from the set of markers detected by these softwares and show that they present an attractive alternative strategy, allowing to reach better performance for several drugs while significantly reducing the number of candidate mutations to consider.
RESUMO
Promotion time models have been recently adapted to the context of infectious diseases to take into account discrete and multiple exposures. However, Poisson distribution of the number of pathogens transmitted at each exposure was a very strong assumption and did not allow for inter-individual heterogeneity. Bernoulli, the negative binomial, and the compound Poisson distributions were proposed as alternatives to Poisson distribution for the promotion time model with time-changing exposure. All were derived within the frailty model framework. All these distributions have a point mass at zero to take into account non-infected people. Bernoulli distribution, the two-component cure rate model, was extended to multiple exposures. Contrary to the negative binomial and the compound Poisson distributions, Bernoulli distribution did not enable to connect the number of pathogens transmitted to the delay between transmission and infection detection. Moreover, the two former distributions enable to account for inter-individual heterogeneity. The delay to surgical site infection was an example of single exposure. The probability of infection was very low; thus, estimation of the effect of selected risk factors on that probability obtained with Bernoulli and Poisson distributions were very close. The delay to nosocomial urinary tract infection was a multiple exposure example. The probabilities of pathogen transmission during catheter placement and catheter presence were estimated. Inter-individual heterogeneity was very high, and the fit was better with the compound Poisson and the negative binomial distributions. The proposed models proved to be also mechanistic. The negative binomial and the compound Poisson distributions were useful alternatives to account for inter-individual heterogeneity.
Assuntos
Doenças Transmissíveis/epidemiologia , Modelos Biológicos , Modelos Estatísticos , Infecção Hospitalar/epidemiologia , Humanos , Estimativa de Kaplan-Meier , Infecção da Ferida Cirúrgica/epidemiologia , Infecções Urinárias/epidemiologiaRESUMO
For the last century, in vitro diagnostic process in microbiology has mainly relied on the growth of bacteria on the surface of a solid agar medium. Nevertheless, few studies focused in the past on the dynamics of microcolonies growth on agar surface before 8 to 10h of incubation. In this article, chromatic confocal microscopy has been applied to characterize the early development of a bacterial colony. This technology relies on a differential focusing depth of the white light. It allows one to fully measure the tridimensional shape of microcolonies more quickly than classical confocal microscopy but with the same spatial resolution. Placing the device in an incubator, the method was able to individually track colonies growing on an agar plate, and to follow the evolution of their surface or volume. Using an appropriate statistical modeling framework, for a given microorganism, the doubling time has been estimated for each individual colony, as well as its variability between colonies, both within and between agar plates. A proof of concept led on four bacterial strains of four distinct species demonstrated the feasibility and the interest of the approach. It showed in particular that doubling times derived from early tri-dimensional measurements on microcolonies differed from classical measurements in micro-dilutions based on optical diffusion. Such a precise characterization of the tri-dimensional shape of microcolonies in their late-lag to early-exponential phase could be beneficial in terms of in vitro diagnostics. Indeed, real-time monitoring of the biomass available in a colony could allow to run well established microbial identification workflows like, for instance, MALDI-TOF mass-spectrometry, as soon as a sufficient quantity of material is available, thereby reducing the time needed to provide a diagnostic. Moreover, as done for pre-identification of macro-colonies, morphological indicators such as three-dimensional growth profiles derived from microcolonies could be used to perform a first pre-identification step, but in a shorten time.
Assuntos
Bactérias/crescimento & desenvolvimento , Meios de Cultura/química , Processamento de Imagem Assistida por Computador/métodos , Imageamento Tridimensional , Ágar , Técnicas Bacteriológicas/métodosRESUMO
The Yakovlev parametric cure rate model was used to study the age at which HIV-1 infection can be detected in nonbreast-fed infants and the independent predictors of transmission. Blood samples from 145 HIV-1-negative at birth infants born to HIV-1-positive untreated mothers were tested until 15 months. The age at actual detection and at potential detection providing daily tests was studied. The former was described using the Yakovlev model, and the cumulative probabilities of detection along time were calculated. Comparison of observed and predicted delays to positivity revealed the best representation of the age at which HIV becomes detectable among 8 Yakovlev models. Cumulative positive tests were as follows: 3 at 7 days, 10 at 15 days and 1 month, 17 at 3 months, and 18 at 15 months. The log-logistic model was the best-fitting one. The probability of onset of HIV-1 detectability was maximal at day 4. The mean and median age at which HIV becomes detectable were 12 and 6 days, respectively. Maternal CD4(+) cell count was associated with the risk of contamination [hazard ratio of low vs. high count 2.44; 95% (confidence interval): 1.15-6.67]. The model may explain HIV viremia dynamics and define the optimal antiretroviral regimens before randomized trial confirmation.