Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
mSystems ; 8(4): e0005823, 2023 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-37314210

RESUMO

Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in Escherichia coli genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs, and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%-90% of all E. coli genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943-0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including "hypothetical proteins" was accurately predicted (F1 = 0.902 [0.898-0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse E. coli genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876-0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data. IMPORTANCE Having the ability to predict the protein-encoding gene content of a genome is important for assessing genome quality, binning genomes from shotgun metagenomic assemblies, and assessing risk due to the presence of antimicrobial resistance and other virulence genes. In this study, we built a set of binary classifiers for predicting the presence or absence of variable genes occurring in 10%-90% of all publicly available E. coli genomes. Overall, the results show that a large portion of the E. coli variable gene content can be predicted with high accuracy, including genes with functions relating to horizontal gene transfer. This study offers a strategy for predicting gene content using limited input sequence data.


Assuntos
Anti-Infecciosos , Proteínas de Escherichia coli , Escherichia coli/genética , Genoma Bacteriano/genética , Plasmídeos , Proteínas de Escherichia coli/genética
2.
Nucleic Acids Res ; 51(D1): D678-D689, 2023 01 06.
Artigo em Inglês | MEDLINE | ID: mdl-36350631

RESUMO

The National Institute of Allergy and Infectious Diseases (NIAID) established the Bioinformatics Resource Center (BRC) program to assist researchers with analyzing the growing body of genome sequence and other omics-related data. In this report, we describe the merger of the PAThosystems Resource Integration Center (PATRIC), the Influenza Research Database (IRD) and the Virus Pathogen Database and Analysis Resource (ViPR) BRCs to form the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) https://www.bv-brc.org/. The combined BV-BRC leverages the functionality of the bacterial and viral resources to provide a unified data model, enhanced web-based visualization and analysis tools, bioinformatics services, and a powerful suite of command line tools that benefit the bacterial and viral research communities.


Assuntos
Genômica , Software , Vírus , Humanos , Bactérias/genética , Biologia Computacional , Bases de Dados Genéticas , Influenza Humana , Vírus/genética
3.
PLoS One ; 17(12): e0279280, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36525447

RESUMO

Plasmids are important genetic elements that facilitate horizonal gene transfer between bacteria and contribute to the spread of virulence and antimicrobial resistance. Most bacterial genome sequences in the public archives exist in draft form with many contigs, making it difficult to determine if a contig is of chromosomal or plasmid origin. Using a training set of contigs comprising 10,584 chromosomes and 10,654 plasmids from the PATRIC database, we evaluated several machine learning models including random forest, logistic regression, XGBoost, and a neural network for their ability to classify chromosomal and plasmid sequences using nucleotide k-mers as features. Based on the methods tested, a neural network model that used nucleotide 6-mers as features that was trained on randomly selected chromosomal and plasmid subsequences 5kb in length achieved the best performance, outperforming existing out-of-the-box methods, with an average accuracy of 89.38% ± 2.16% over a 10-fold cross validation. The model accuracy can be improved to 92.08% by using a voting strategy when classifying holdout sequences. In both plasmids and chromosomes, subsequences encoding functions involved in horizontal gene transfer-including hypothetical proteins, transporters, phage, mobile elements, and CRISPR elements-were most likely to be misclassified by the model. This study provides a straightforward approach for identifying plasmid-encoding sequences in short read assemblies without the need for sequence alignment-based tools.


Assuntos
Cromossomos Bacterianos , Genoma Bacteriano , Plasmídeos/genética , Cromossomos Bacterianos/genética , Bactérias/genética , Aprendizado de Máquina , Nucleotídeos
4.
Microbiol Spectr ; 10(6): e0264122, 2022 12 21.
Artigo em Inglês | MEDLINE | ID: mdl-36377945

RESUMO

High-throughput genome sequencing technologies enable the investigation of complex genetic interactions, including the horizontal gene transfer of plasmids and bacteriophages. However, identifying these elements from assembled reads remains challenging due to genome sequence plasticity and the difficulty in assembling complete sequences. In this study, we developed a classifier, using random forest, to identify whether sequences originated from bacterial chromosomes, plasmids, or bacteriophages. The classifier was trained on a diverse collection of 23,211 chromosomal, plasmid, and bacteriophage sequences from hundreds of bacterial species. In order to adapt the classifier to incomplete sequences, each complete sequence was subsampled into 5,000 nucleotide fragments and further subdivided into k-mers. This three-class classifier succeeded in identifying chromosomes, plasmids, and bacteriophages using k-mer distributions of complete and partial genome sequences, including simulated metagenomic scaffolds with minimum performance of 0.939 area under the receiver operating characteristic curve (AUC). This classifier, implemented as SourceFinder, has been made available as an online web service to help the community with predicting the chromosomal, plasmid, and bacteriophage sources of assembled bacterial sequence data (https://cge.food.dtu.dk/services/SourceFinder/). IMPORTANCE Extra-chromosomal genes encoding antimicrobial resistance, metal resistance, and virulence provide selective advantages for bacterial survival under stress conditions and pose serious threats to human and animal health. These accessory genes can impact the composition of microbiomes by providing selective advantages to their hosts. Accurately identifying extra-chromosomal elements in genome sequence data are critical for understanding gene dissemination trajectories and taking preventative measures. Therefore, in this study, we developed a random forest classifier for identifying the source of bacterial chromosomal, plasmid, and bacteriophage sequences.


Assuntos
Bacteriófagos , Genoma Bacteriano , Humanos , Bacteriófagos/genética , Plasmídeos/genética , Cromossomos Bacterianos/genética , Aprendizado de Máquina
5.
mSystems ; 7(2): e0118021, 2022 04 26.
Artigo em Inglês | MEDLINE | ID: mdl-35382558

RESUMO

Plasmids play a major role facilitating the spread of antimicrobial resistance between bacteria. Understanding the host range and dissemination trajectories of plasmids is critical for surveillance and prevention of antimicrobial resistance. Identification of plasmid host ranges could be improved using automated pattern detection methods compared to homology-based methods due to the diversity and genetic plasticity of plasmids. In this study, we developed a method for predicting the host range of plasmids using machine learning-specifically, random forests. We trained the models with 8,519 plasmids from 359 different bacterial species per taxonomic level; the models achieved Matthews correlation coefficients of 0.662 and 0.867 at the species and order levels, respectively. Our results suggest that despite the diverse nature and genetic plasticity of plasmids, our random forest model can accurately distinguish between plasmid hosts. This tool is available online through the Center for Genomic Epidemiology (https://cge.cbs.dtu.dk/services/PlasmidHostFinder/). IMPORTANCE Antimicrobial resistance is a global health threat to humans and animals, causing high mortality and morbidity while effectively ending decades of success in fighting against bacterial infections. Plasmids confer extra genetic capabilities to the host organisms through accessory genes that can encode antimicrobial resistance and virulence. In addition to lateral inheritance, plasmids can be transferred horizontally between bacterial taxa. Therefore, detection of the host range of plasmids is crucial for understanding and predicting the dissemination trajectories of extrachromosomal genes and bacterial evolution as well as taking effective countermeasures against antimicrobial resistance.


Assuntos
Anti-Infecciosos , Algoritmo Florestas Aleatórias , Animais , Humanos , Plasmídeos , Bactérias/genética , Genômica
6.
Am J Pathol ; 192(2): 320-331, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34774517

RESUMO

Genetic variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) have repeatedly altered the course of the coronavirus disease 2019 (COVID-19) pandemic. Delta variants are now the focus of intense international attention because they are causing widespread COVID-19 globally and are associated with vaccine breakthrough cases. We sequenced 16,965 SARS-CoV-2 genomes from samples acquired March 15, 2021, through September 20, 2021, in the Houston Methodist hospital system. This sample represents 91% of all Methodist system COVID-19 patients during the study period. Delta variants increased rapidly from late April onward to cause 99.9% of all COVID-19 cases and spread throughout the Houston metroplex. Compared with all other variants combined, Delta caused a significantly higher rate of vaccine breakthrough cases (23.7% for Delta compared with 6.6% for all other variants combined). Importantly, significantly fewer fully vaccinated individuals required hospitalization. Vaccine breakthrough cases caused by Delta had a low median PCR cycle threshold value (a proxy for high virus load). This value was similar to the median cycle threshold value for unvaccinated patients with COVID-19 caused by Delta variants, suggesting that fully vaccinated individuals can transmit SARS-CoV-2 to others. Patients infected with Alpha and Delta variants had several significant differences. The integrated analysis indicates that vaccines used in the United States are highly effective in decreasing severe COVID-19, hospitalizations, and deaths.


Assuntos
COVID-19/virologia , SARS-CoV-2 , Adulto , Vacinas contra COVID-19 , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Texas
7.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34379107

RESUMO

Antimicrobial resistance (AMR) is a major global health threat that affects millions of people each year. Funding agencies worldwide and the global research community have expended considerable capital and effort tracking the evolution and spread of AMR by isolating and sequencing bacterial strains and performing antimicrobial susceptibility testing (AST). For the last several years, we have been capturing these efforts by curating data from the literature and data resources and building a set of assembled bacterial genome sequences that are paired with laboratory-derived AST data. This collection currently contains AST data for over 67 000 genomes encompassing approximately 40 genera and over 100 species. In this paper, we describe the characteristics of this collection, highlighting areas where sampling is comparatively deep or shallow, and showing areas where attention is needed from the research community to improve sampling and tracking efforts. In addition to using the data to track the evolution and spread of AMR, it also serves as a useful starting point for building machine learning models for predicting AMR phenotypes. We demonstrate this by describing two machine learning models that are built from the entire dataset to show where the predictive power is comparatively high or low. This AMR metadata collection is freely available and maintained on the Bacterial and Viral Bioinformatics Center (BV-BRC) FTP site ftp://ftp.bvbrc.org/RELEASE_NOTES/PATRIC_genomes_AMR.txt.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Resistência Microbiana a Medicamentos , Genômica/métodos , Testes de Sensibilidade Microbiana , Inteligência Artificial , Bactérias/efeitos dos fármacos , Bactérias/genética , Genoma Bacteriano , Humanos , Laboratórios , Aprendizado de Máquina , Fenótipo
8.
Am J Pathol ; 191(10): 1754-1773, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-34303698

RESUMO

Certain genetic variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are of substantial concern because they may be more transmissible or detrimentally alter the pandemic course and disease features in individual patients. SARS-CoV-2 genome sequences from 12,476 patients in the Houston Methodist health care system diagnosed from January 1 through May 31, 2021 are reported here. Prevalence of the B.1.1.7 (Alpha) variant increased rapidly and caused 63% to 90% of new cases in the latter half of May. Eleven B.1.1.7 genomes had an E484K replacement in spike protein, a change also identified in other SARS-CoV-2 lineages. Compared with non-B.1.1.7-infected patients, individuals with B.1.1.7 had a significantly lower cycle threshold (a proxy for higher virus load) and significantly higher hospitalization rate. Other variants [eg, B.1.429 and B.1.427 (Epsilon), P.1 (Gamma), P.2 (Zeta), and R.1] also increased rapidly, although the magnitude was less than that in B.1.1.7. Twenty-two patients infected with B.1.617.1 (Kappa) or B.1.617.2 (Delta) variants had a high rate of hospitalization. Breakthrough cases (n = 207) in fully vaccinated patients were caused by a heterogeneous array of virus genotypes, including many not currently designated variants of interest or concern. In the aggregate, this study delineates the trajectory of SARS-CoV-2 variants circulating in a major metropolitan area, documents B.1.1.7 as the major cause of new cases in Houston, TX, and heralds the arrival of B.1.617 variants in the metroplex.


Assuntos
COVID-19/epidemiologia , Genoma Viral , Mutação , SARS-CoV-2/genética , COVID-19/genética , COVID-19/transmissão , COVID-19/virologia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , SARS-CoV-2/isolamento & purificação , Texas/epidemiologia
10.
mBio ; 11(6)2020 10 30.
Artigo em Inglês | MEDLINE | ID: mdl-33127862

RESUMO

We sequenced the genomes of 5,085 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains causing two coronavirus disease 2019 (COVID-19) disease waves in metropolitan Houston, TX, an ethnically diverse region with 7 million residents. The genomes were from viruses recovered in the earliest recognized phase of the pandemic in Houston and from viruses recovered in an ongoing massive second wave of infections. The virus was originally introduced into Houston many times independently. Virtually all strains in the second wave have a Gly614 amino acid replacement in the spike protein, a polymorphism that has been linked to increased transmission and infectivity. Patients infected with the Gly614 variant strains had significantly higher virus loads in the nasopharynx on initial diagnosis. We found little evidence of a significant relationship between virus genotype and altered virulence, stressing the linkage between disease severity, underlying medical conditions, and host genetics. Some regions of the spike protein-the primary target of global vaccine efforts-are replete with amino acid replacements, perhaps indicating the action of selection. We exploited the genomic data to generate defined single amino acid replacements in the receptor binding domain of spike protein that, importantly, produced decreased recognition by the neutralizing monoclonal antibody CR3022. Our report represents the first analysis of the molecular architecture of SARS-CoV-2 in two infection waves in a major metropolitan region. The findings will help us to understand the origin, composition, and trajectory of future infection waves and the potential effect of the host immune response and therapeutic maneuvers on SARS-CoV-2 evolution.IMPORTANCE There is concern about second and subsequent waves of COVID-19 caused by the SARS-CoV-2 coronavirus occurring in communities globally that had an initial disease wave. Metropolitan Houston, TX, with a population of 7 million, is experiencing a massive second disease wave that began in late May 2020. To understand SARS-CoV-2 molecular population genomic architecture and evolution and the relationship between virus genotypes and patient features, we sequenced the genomes of 5,085 SARS-CoV-2 strains from these two waves. Our report provides the first molecular characterization of SARS-CoV-2 strains causing two distinct COVID-19 disease waves.


Assuntos
Betacoronavirus/genética , Infecções por Coronavirus/virologia , Pneumonia Viral/virologia , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Sequência de Aminoácidos , Substituição de Aminoácidos , Anticorpos Neutralizantes/imunologia , Sequência de Bases , Betacoronavirus/imunologia , COVID-19 , Teste para COVID-19 , Técnicas de Laboratório Clínico , Infecções por Coronavirus/diagnóstico , Infecções por Coronavirus/epidemiologia , Infecções por Coronavirus/imunologia , RNA-Polimerase RNA-Dependente de Coronavírus , Genoma Viral , Genótipo , Humanos , Aprendizado de Máquina , Modelos Moleculares , Técnicas de Diagnóstico Molecular , Pandemias , Filogenia , Pneumonia Viral/epidemiologia , Pneumonia Viral/imunologia , RNA Polimerase Dependente de RNA/química , RNA Polimerase Dependente de RNA/genética , SARS-CoV-2 , Análise de Sequência de Proteína , Glicoproteína da Espícula de Coronavírus/imunologia , Texas/epidemiologia , Proteínas não Estruturais Virais/química , Proteínas não Estruturais Virais/genética
11.
PLoS Comput Biol ; 16(10): e1008319, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-33075053

RESUMO

A growing number of studies are using machine learning models to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. Although these studies are showing promise, the models are typically trained using features derived from comprehensive sets of AMR genes or whole genome sequences and may not be suitable for use when genomes are incomplete. In this study, we explore the possibility of predicting AMR phenotypes using incomplete genome sequence data. Models were built from small sets of randomly-selected core genes after removing the AMR genes. For Klebsiella pneumoniae, Mycobacterium tuberculosis, Salmonella enterica, and Staphylococcus aureus, we report that it is possible to classify susceptible and resistant phenotypes with average F1 scores ranging from 0.80-0.89 with as few as 100 conserved non-AMR genes, with very major error rates ranging from 0.11-0.23 and major error rates ranging from 0.10-0.20. Models built from core genes have predictive power in cases where the primary AMR mechanisms result from SNPs or horizontal gene transfer. By randomly sampling non-overlapping sets of core genes, we show that F1 scores and error rates are stable and have little variance between replicates. Although these small core gene models have lower accuracies and higher error rates than models built from the corresponding assembled genomes, the results suggest that sufficient variation exists in the core non-AMR genes of a species for predicting AMR phenotypes.


Assuntos
Sequência Conservada/genética , Farmacorresistência Bacteriana/genética , Genoma Bacteriano/genética , Genômica/métodos , Aprendizado de Máquina , Algoritmos , Antibacterianos/farmacologia , Bactérias/efeitos dos fármacos , Bactérias/genética , Fenótipo
12.
medRxiv ; 2020 Sep 29.
Artigo em Inglês | MEDLINE | ID: mdl-33024977

RESUMO

We sequenced the genomes of 5,085 SARS-CoV-2 strains causing two COVID-19 disease waves in metropolitan Houston, Texas, an ethnically diverse region with seven million residents. The genomes were from viruses recovered in the earliest recognized phase of the pandemic in Houston, and an ongoing massive second wave of infections. The virus was originally introduced into Houston many times independently. Virtually all strains in the second wave have a Gly614 amino acid replacement in the spike protein, a polymorphism that has been linked to increased transmission and infectivity. Patients infected with the Gly614 variant strains had significantly higher virus loads in the nasopharynx on initial diagnosis. We found little evidence of a significant relationship between virus genotypes and altered virulence, stressing the linkage between disease severity, underlying medical conditions, and host genetics. Some regions of the spike protein - the primary target of global vaccine efforts - are replete with amino acid replacements, perhaps indicating the action of selection. We exploited the genomic data to generate defined single amino acid replacements in the receptor binding domain of spike protein that, importantly, produced decreased recognition by the neutralizing monoclonal antibody CR30022. Our study is the first analysis of the molecular architecture of SARS-CoV-2 in two infection waves in a major metropolitan region. The findings will help us to understand the origin, composition, and trajectory of future infection waves, and the potential effect of the host immune response and therapeutic maneuvers on SARS-CoV-2 evolution.

13.
mBio ; 11(4)2020 08 25.
Artigo em Inglês | MEDLINE | ID: mdl-32843552

RESUMO

Variation in the genome of Pseudomonas aeruginosa, an important pathogen, can have dramatic impacts on the bacterium's ability to cause disease. We therefore asked whether it was possible to predict the virulence of P. aeruginosa isolates based on their genomic content. We applied a machine learning approach to a genetically and phenotypically diverse collection of 115 clinical P. aeruginosa isolates using genomic information and corresponding virulence phenotypes in a mouse model of bacteremia. We defined the accessory genome of these isolates through the presence or absence of accessory genomic elements (AGEs), sequences present in some strains but not others. Machine learning models trained using AGEs were predictive of virulence, with a mean nested cross-validation accuracy of 75% using the random forest algorithm. However, individual AGEs did not have a large influence on the algorithm's performance, suggesting instead that virulence predictions are derived from a diffuse genomic signature. These results were validated with an independent test set of 25 P. aeruginosa isolates whose virulence was predicted with 72% accuracy. Machine learning models trained using core genome single-nucleotide variants and whole-genome k-mers also predicted virulence. Our findings are a proof of concept for the use of bacterial genomes to predict pathogenicity in P. aeruginosa and highlight the potential of this approach for predicting patient outcomes.IMPORTANCEPseudomonas aeruginosa is a clinically important Gram-negative opportunistic pathogen. P. aeruginosa shows a large degree of genomic heterogeneity both through variation in sequences found throughout the species (core genome) and through the presence or absence of sequences in different isolates (accessory genome). P. aeruginosa isolates also differ markedly in their ability to cause disease. In this study, we used machine learning to predict the virulence level of P. aeruginosa isolates in a mouse bacteremia model based on genomic content. We show that both the accessory and core genomes are predictive of virulence. This study provides a machine learning framework to investigate relationships between bacterial genomes and complex phenotypes such as virulence.


Assuntos
Genoma Bacteriano , Aprendizado de Máquina , Pseudomonas aeruginosa/genética , Pseudomonas aeruginosa/patogenicidade , Virulência , Algoritmos , Animais , Bacteriemia/microbiologia , Feminino , Genômica , Camundongos , Camundongos Endogâmicos BALB C , Fenótipo , Estudo de Prova de Conceito , Infecções por Pseudomonas/microbiologia , Virulência/genética
14.
Nucleic Acids Res ; 48(D1): D606-D612, 2020 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-31667520

RESUMO

The PathoSystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center funded by the National Institute of Allergy and Infectious Diseases (https://www.patricbrc.org). PATRIC supports bioinformatic analyses of all bacteria with a special emphasis on pathogens, offering a rich comparative analysis environment that provides users with access to over 250 000 uniformly annotated and publicly available genomes with curated metadata. PATRIC offers web-based visualization and comparative analysis tools, a private workspace in which users can analyze their own data in the context of the public collections, services that streamline complex bioinformatic workflows and command-line tools for bulk data analysis. Over the past several years, as genomic and other omics-related experiments have become more cost-effective and widespread, we have observed considerable growth in the usage of and demand for easy-to-use, publicly available bioinformatic tools and services. Here we report the recent updates to the PATRIC resource, including new web-based comparative analysis tools, eight new services and the release of a command-line interface to access, query and analyze data.


Assuntos
Bactérias/genética , Biologia Computacional/métodos , Bases de Dados Genéticas , Algoritmos , Animais , Caenorhabditis elegans/genética , Galinhas/genética , Drosophila melanogaster/genética , Interações Hospedeiro-Patógeno/genética , Humanos , Internet , Macaca mulatta/genética , Metagenômica , Camundongos , National Institute of Allergy and Infectious Diseases (U.S.) , Fenótipo , Filogenia , Ratos , Suínos/genética , Estados Unidos , Peixe-Zebra/genética
15.
J Clin Microbiol ; 57(2)2019 02.
Artigo em Inglês | MEDLINE | ID: mdl-30333126

RESUMO

Nontyphoidal Salmonella species are the leading bacterial cause of foodborne disease in the United States. Whole-genome sequences and paired antimicrobial susceptibility data are available for Salmonella strains because of surveillance efforts from public health agencies. In this study, a collection of 5,278 nontyphoidal Salmonella genomes, collected over 15 years in the United States, was used to generate extreme gradient boosting (XGBoost)-based machine learning models for predicting MICs for 15 antibiotics. The MIC prediction models had an overall average accuracy of 95% within ±1 2-fold dilution step (confidence interval, 95% to 95%), an average very major error rate of 2.7% (confidence interval, 2.4% to 3.0%), and an average major error rate of 0.1% (confidence interval, 0.1% to 0.2%). The model predicted MICs with no a priori information about the underlying gene content or resistance phenotypes of the strains. By selecting diverse genomes for the training sets, we show that highly accurate MIC prediction models can be generated with less than 500 genomes. We also show that our approach for predicting MICs is stable over time, despite annual fluctuations in antimicrobial resistance gene content in the sampled genomes. Finally, using feature selection, we explore the important genomic regions identified by the models for predicting MICs. To date, this is one of the largest MIC modeling studies to be published. Our strategy for developing whole-genome sequence-based models for surveillance and clinical diagnostics can be readily applied to other important human pathogens.


Assuntos
Farmacorresistência Bacteriana , Técnicas de Genotipagem/métodos , Aprendizado de Máquina , Testes de Sensibilidade Microbiana/métodos , Infecções por Salmonella/microbiologia , Salmonella/efeitos dos fármacos , Salmonella/genética , Doenças Transmitidas por Alimentos/microbiologia , Genoma Bacteriano , Humanos , Salmonella/isolamento & purificação , Estados Unidos
16.
Sci Rep ; 8(1): 421, 2018 01 11.
Artigo em Inglês | MEDLINE | ID: mdl-29323230

RESUMO

Antimicrobial resistant infections are a serious public health threat worldwide. Whole genome sequencing approaches to rapidly identify pathogens and predict antibiotic resistance phenotypes are becoming more feasible and may offer a way to reduce clinical test turnaround times compared to conventional culture-based methods, and in turn, improve patient outcomes. In this study, we use whole genome sequence data from 1668 clinical isolates of Klebsiella pneumoniae to develop a XGBoost-based machine learning model that accurately predicts minimum inhibitory concentrations (MICs) for 20 antibiotics. The overall accuracy of the model, within ±1 two-fold dilution factor, is 92%. Individual accuracies are ≥90% for 15/20 antibiotics. We show that the MICs predicted by the model correlate with known antimicrobial resistance genes. Importantly, the genome-wide approach described in this study offers a way to predict MICs for isolates without knowledge of the underlying gene content. This study shows that machine learning can be used to build a complete in silico MIC prediction panel for K. pneumoniae and provides a framework for building MIC prediction models for other pathogenic bacteria.


Assuntos
Antibacterianos/farmacologia , Infecções por Klebsiella/microbiologia , Klebsiella pneumoniae/genética , Sequenciamento Completo do Genoma/métodos , Simulação por Computador , DNA Bacteriano/genética , Farmacorresistência Bacteriana Múltipla , Humanos , Klebsiella pneumoniae/efeitos dos fármacos , Aprendizado de Máquina , Testes de Sensibilidade Microbiana , Modelos Teóricos
17.
Toxins (Basel) ; 7(10): 4035-53, 2015 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-26473921

RESUMO

Horizontal gene transfer (HGT) is a fast-track mechanism that allows genetically unrelated organisms to exchange genes for rapid environmental adaptation. We developed a new phyletic distribution-based software, HGT-Finder, which implements a novel bioinformatics algorithm to calculate a horizontal transfer index and a probability value for each query gene. Applying this new tool to the Aspergillus fumigatus, Aspergillus flavus, and Aspergillus nidulans genomes, we found 273, 542, and 715 transferred genes (HTGs), respectively. HTGs have shorter length, higher guanine-cytosine (GC) content, and relaxed selection pressure. Metabolic process and secondary metabolism functions are significantly enriched in HTGs. Gene clustering analysis showed that 61%, 41% and 74% of HTGs in the three genomes form physically linked gene clusters (HTGCs). Overlapping manually curated, secondary metabolite gene clusters (SMGCs) with HTGCs found that 9 of the 33 A. fumigatus SMGCs and 31 of the 65 A. nidulans SMGCs share genes with HTGCs, and that HTGs are significantly enriched in SMGCs. Our genome-wide analysis thus presented very strong evidence to support the hypothesis that HGT has played a very critical role in the evolution of SMGCs. The program is freely available at http://cys.bios.niu.edu/HGTFinder/ HGTFinder.tar.gz.


Assuntos
Aspergillus/genética , Biologia Computacional/métodos , Transferência Genética Horizontal , Genoma Fúngico , Software , Adaptação Biológica/genética , Algoritmos , Evolução Molecular , Genes Fúngicos , Família Multigênica , Filogenia , Especificidade da Espécie
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...