Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 92
Filtrar
1.
G3 (Bethesda) ; 2024 Jul 12.
Artículo en Inglés | MEDLINE | ID: mdl-38996058

RESUMEN

The genetic effective size (Ne) is arguably one of the most important characteristics of a population as it impacts the rate of loss of genetic diversity. Methods that estimate Ne are important in population and conservation genetic studies as they quantify the risk of a population being inbred or lacking genetic diversity. Yet there are very few methods that can estimate the Ne from data from a single population and without extensive information about the genetics of the population, such as a linkage map, or a reference genome of the species of interest. We present ONeSAMP 3.0, an algorithm for estimating Ne from single nucleotide polymorphism (SNP) data collected from a single population sample using Approximate Bayesian Computation and local linear regression. We demonstrate the utility of this approach using simulated Wright-Fisher populations, and empirical data from five endangered Channel Island fox (Urocyon littoralis) populations to evaluate the performance of ONeSAMP 3.0 compared to a commonly used Ne estimator. Our results show that ONeSAMP 3.0 is is broadly applicable to natural populations and is flexible enough that future versions could easily include summary statistics appropriate for a suite of biological and sampling conditions. ONeSAMP 3.0 is publicly available under the GNU license at https://github.com/AaronHong1024/ONeSAMP_3.

2.
bioRxiv ; 2024 May 30.
Artículo en Inglés | MEDLINE | ID: mdl-38854039

RESUMEN

Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution. Advances in compressed indexing with the r -index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use 𝒪 ( rd ) words of space where r is the number of maximal-equal letter runs in the Burrows-Wheeler transform and d is the number of distinct genomes. The linear dependence on d is limiting, since real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, over 250x when indexing the SILVA 16S rRNA gene database. This method uses Θ( r log d ) words of space in expectation under a random model we propose here. We implemented these ideas in an open source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11-18%. Clade abundances are also more accurately predicted by Cliffy compared to Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries. 2012 ACM Subject Classification: Applied computing → Computational genomics.

3.
Bioinformatics ; 40(Supplement_1): i39-i47, 2024 Jun 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940175

RESUMEN

MOTIVATION: World Health Organization estimates that there were over 10 million cases of tuberculosis (TB) worldwide in 2019, resulting in over 1.4 million deaths, with a worrisome increasing trend yearly. The disease is caused by Mycobacterium tuberculosis (MTB) through airborne transmission. Treatment of TB is estimated to be 85% successful, however, this drops to 57% if MTB exhibits multiple antimicrobial resistance (AMR), for which fewer treatment options are available. RESULTS: We develop a robust machine-learning classifier using both linear and nonlinear models (i.e. LASSO logistic regression (LR) and random forests (RF)) to predict the phenotypic resistance of Mycobacterium tuberculosis (MTB) for a broad range of antibiotic drugs. We use data from the CRyPTIC consortium to train our classifier, which consists of whole genome sequencing and antibiotic susceptibility testing (AST) phenotypic data for 13 different antibiotics. To train our model, we assemble the sequence data into genomic contigs, identify all unique 31-mers in the set of contigs, and build a feature matrix M, where M[i, j] is equal to the number of times the ith 31-mer occurs in the jth genome. Due to the size of this feature matrix (over 350 million unique 31-mers), we build and use a sparse matrix representation. Our method, which we refer to as MTB++, leverages compact data structures and iterative methods to allow for the screening of all the 31-mers in the development of both LASSO LR and RF. MTB++ is able to achieve high discrimination (F-1 >80%) for the first-line antibiotics. Moreover, MTB++ had the highest F-1 score in all but three classes and was the most comprehensive since it had an F-1 score >75% in all but four (rare) antibiotic drugs. We use our feature selection to contextualize the 31-mers that are used for the prediction of phenotypic resistance, leading to some insights about sequence similarity to genes in MEGARes. Lastly, we give an estimate of the amount of data that is needed in order to provide accurate predictions. AVAILABILITY: The models and source code are publicly available on Github at https://github.com/M-Serajian/MTB-Pipeline.


Asunto(s)
Aprendizaje Automático , Mycobacterium tuberculosis , Mycobacterium tuberculosis/genética , Mycobacterium tuberculosis/efectos de los fármacos , Farmacorresistencia Bacteriana/genética , Pruebas de Sensibilidad Microbiana , Antibacterianos/farmacología , Secuenciación Completa del Genoma/métodos , Genoma Bacteriano , Humanos
4.
J Hazard Mater ; 473: 134694, 2024 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-38788585

RESUMEN

Wildlife is known to serve as carriers and sources of antimicrobial resistance (AMR). Due to their unrestricted movements and behaviors, they can spread antimicrobial resistant bacteria among livestock, humans, and the environment, thereby accelerating the dissemination of AMR. Extended-spectrum ß-lactamase (ESBL)-producing Enterobacteriaceae is one of major concerns threatening human and animal health, yet transmission mechanisms at the wildlife-livestock interface are not well understood. Here, we investigated the mechanisms of ESBL-producing bacteria spreading across various hosts, including cattle, feral swine, and coyotes in the same habitat range, as well as from environmental samples over a two-year period. We report a notable prevalence and clonal dissemination of ESBL-producing E. coli in feral swine and coyotes, suggesting their persistence and adaptation within wildlife hosts. In addition, in silico studies showed that horizontal gene transfer, mediated by conjugative plasmids and insertion sequences elements, may play a key role in spreading the ESBL genes among these bacteria. Furthermore, the shared gut resistome of cattle and feral swine suggests the dissemination of antibiotic resistance genes at the wildlife-livestock interface. Taken together, our results suggest that feral swine may serve as a reservoir of ESBL-producing E. coli.


Asunto(s)
Animales Salvajes , Reservorios de Enfermedades , Escherichia coli , beta-Lactamasas , Animales , Escherichia coli/genética , Escherichia coli/efectos de los fármacos , Escherichia coli/enzimología , beta-Lactamasas/genética , beta-Lactamasas/metabolismo , Animales Salvajes/microbiología , Porcinos , Reservorios de Enfermedades/microbiología , Bovinos , Transferencia de Gen Horizontal , Ganado/microbiología , Farmacorresistencia Bacteriana , Infecciones por Escherichia coli/microbiología , Infecciones por Escherichia coli/veterinaria
5.
bioRxiv ; 2024 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-38559026

RESUMEN

Portable genomic sequencers such as Oxford Nanopore's MinION enable real-time applications in both clinical and environmental health, e.g., detection of bacterial outbreaks. However, there is a bottleneck in the downstream analytics when bioinformatics pipelines are unavailable, e.g., when cloud processing is unreachable due to absence of Internet connection, or only low-end computing devices can be carried on site. For instance, metagenomics classifiers usually require a large amount of memory or specific operating systems/libraries. In this work, we present a platform-friendly software for portable metagenomic analysis of Nanopore data, the Oligomer-based Classifier of Taxonomic Operational and Pan-genome Units via Singletons (OCTOPUS). OCTOPUS is written in Java, reimplements several features of the popular Kraken2 and KrakenUniq software, with original components for improving metagenomics classification on incomplete/sampled reference databases (e.g., selection of bacteria of public health priority), making it ideal for running on smartphones or tablets. We indexed both OCTOPUS and Kraken2 on a bacterial database with ~4,000 reference genomes, then simulated a positive (bacterial genomes from the same species, but different genomes) and two negative (viral, mammalian) Nanopore test sets. On the bacterial test set OCTOPUS yielded sensitivity and precision comparable to Kraken2 (94.4% and 99.8% versus 94.5% and 99.1%, respectively). On non-bacterial sequences (mammals and viral), OCTOPUS dramatically decreased (4- to 16-fold) the false positive rate when compared to Kraken2 (2.1% and 0.7% versus 8.2% and 11.2%, respectively). We also developed customized databases including viruses, and the World Health Organization's set of bacteria of concern for drug resistance, tested with real Nanopore data on an Android smartphone. OCTOPUS is publicly available at https://github.com/DataIntellSystLab/OCTOPUS and https://github.com/Ruiz-HCI-Lab/OctopusMobile.

6.
Algorithms Mol Biol ; 19(1): 15, 2024 Apr 10.
Artículo en Inglés | MEDLINE | ID: mdl-38600518

RESUMEN

FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing-which takes parameters that let us tune the average length of the phrases-instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for PFP - FM is available at https://github.com/AaronHong1024/afm .

8.
Res Sq ; 2023 Oct 30.
Artículo en Inglés | MEDLINE | ID: mdl-37961504

RESUMEN

FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing-which takes parameters that let us tune the average length of the phrases-instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. The source code for PFP-FM is available at https://github.com/marco-oliva/afm.

9.
Bioinformatics ; 39(9)2023 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-37688560

RESUMEN

MOTIVATION: The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS: In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as µ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the µ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, µ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. µ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.


Asunto(s)
Bancos de Muestras Biológicas , Haplotipos , Secuenciación Completa del Genoma , Reino Unido
10.
Genome Biol ; 24(1): 122, 2023 05 18.
Artículo en Inglés | MEDLINE | ID: mdl-37202771

RESUMEN

Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2's index is 65 times smaller than minimap2's for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.


Asunto(s)
Algoritmos , Genómica , Metagenómica , Bases de Datos Factuales , Análisis de Secuencia de ADN
11.
Genome Res ; 33(7): 1069-1077, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37258301

RESUMEN

Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The r-index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where r is the number of Burrows-Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents, [Formula: see text] over an alphabet of size σ, we extend the r-index with [Formula: see text] additional words to support document listing queries for a pattern [Formula: see text] that occurs in [Formula: see text] documents in D in [Formula: see text] time and [Formula: see text] space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.


Asunto(s)
Algoritmos , Bacterias , Análisis de Secuencia , Bacterias/genética
12.
Front Microbiol ; 14: 1060891, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36960290

RESUMEN

Characterization of antibiotic resistance genes (ARGs) from high-throughput sequencing data of metagenomics and cultured bacterial samples is a challenging task, with the need to account for both computational (e.g., string algorithms) and biological (e.g., gene transfers, rearrangements) aspects. Curated ARG databases exist together with assorted ARG classification approaches (e.g., database alignment, machine learning). Besides ARGs that naturally occur in bacterial strains or are acquired through mobile elements, there are chromosomal genes that can render a bacterium resistant to antibiotics through point mutations, i.e., ARG variants (ARGVs). While ARG repositories also collect ARGVs, there are only a few tools that are able to identify ARGVs from metagenomics and high throughput sequencing data, with a number of limitations (e.g., pre-assembly, a posteriori verification of mutations, or specification of species). In this work we present the k-mer, i.e., strings of fixed length k, ARGV analyzer - KARGVA - an open-source, multi-platform tool that provides: (i) an ad hoc, large ARGV database derived from multiple sources; (ii) input capability for various types of high-throughput sequencing data; (iii) a three-way, hash-based, k-mer search setup to process data efficiently, linking k-mers to ARGVs, k-mers to point mutations, and ARGVs to k-mers, respectively; (iv) a statistical filter on sequence classification to reduce type I and II errors. On semi-synthetic data, KARGVA provides very high accuracy even in presence of high sequencing errors or mutations (99.2 and 86.6% accuracy within 1 and 5% base change rates, respectively), and genome rearrangements (98.2% accuracy), with robust performance on ad hoc false positive sets. On data from the worldwide MetaSUB consortium, comprising 3,700+ metagenomics experiments, KARGVA identifies more ARGVs than Resistance Gene Identifier (4.8x) and PointFinder (6.8x), yet all predictions are below the expected false positive estimates. The prevalence of ARGVs is correlated to ARGs but ecological characteristics do not explain well ARGV variance. KARGVA is publicly available at https://github.com/DataIntellSystLab/KARGVA under MIT license.

13.
bioRxiv ; 2023 Jan 20.
Artículo en Inglés | MEDLINE | ID: mdl-36712109

RESUMEN

Prefix-free parsing is useful for a wide variety of purposes including building the BWT, constructing the suffix array, and supporting compressed suffix tree operations. This linear-time algorithm uses a rolling hash to break an input string into substrings, where the resulting set of unique substrings has the property that none of the substrings' suffixes (of more than a certain length) is a proper prefix of any of the other substrings' suffixes. Hence, the name prefix-free parsing. This set of unique substrings is referred to as the dictionary . The parse is the ordered list of dictionary strings that defines the input string. Prior empirical results demonstrated the size of the parse is more burdensome than the size of the dictionary for large, repetitive inputs. Hence, the question arises as to how the size of the parse can scale satisfactorily with the input. Here, we describe our algorithm, recursive prefix-free parsing , which accomplishes this by computing the prefix-free parse of the parse produced by prefix-free parsing an input string. Although conceptually simple, building the BWT from the parse-of-the-parse and the dictionaries is significantly more challenging. We solve and implement this problem. Our experimental results show that recursive prefix-free parsing is extremely effective in reducing the memory needed to build the run-length encoded BWT of the input. Our implementation is open source and available at https://github.com/marco-oliva/r-pfbwt .

14.
Nucleic Acids Res ; 51(D1): D744-D752, 2023 01 06.
Artículo en Inglés | MEDLINE | ID: mdl-36382407

RESUMEN

Antimicrobial resistance (AMR) is considered a critical threat to public health, and genomic/metagenomic investigations featuring high-throughput analysis of sequence data are increasingly common and important. We previously introduced MEGARes, a comprehensive AMR database with an acyclic hierarchical annotation structure that facilitates high-throughput computational analysis, as well as AMR++, a customized bioinformatic pipeline specifically designed to use MEGARes in high-throughput analysis for characterizing AMR genes (ARGs) in metagenomic sequence data. Here, we present MEGARes v3.0, a comprehensive database of published ARG sequences for antimicrobial drugs, biocides, and metals, and AMR++ v3.0, an update to our customized bioinformatic pipeline for high-throughput analysis of metagenomic data (available at MEGLab.org). Database annotations have been expanded to include information regarding specific genomic locations for single-nucleotide polymorphisms (SNPs) and insertions and/or deletions (indels) when required by specific ARGs for resistance expression, and the updated AMR++ pipeline uses this information to check for presence of resistance-conferring genetic variants in metagenomic sequenced reads. This new information encompasses 337 ARGs, whose resistance-conferring variants could not previously be confirmed in such a manner. In MEGARes 3.0, the nodes of the acyclic hierarchical ontology include 4 antimicrobial compound types, 59 resistance classes, 233 mechanisms and 1448 gene groups that classify the 8733 accessions.


Asunto(s)
Antibacterianos , Antiinfecciosos , Antibacterianos/farmacología , Farmacorresistencia Bacteriana/genética , Programas Informáticos , Secuenciación de Nucleótidos de Alto Rendimiento
15.
Front Genet ; 13: 1024577, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36568361

RESUMEN

Horizontal gene transfer mediated by conjugation is considered an important evolutionary mechanism of bacteria. It allows organisms to quickly evolve new phenotypic properties including antimicrobial resistance (AMR) and virulence. The frequency of conjugation-mediated cargo gene exchange has not yet been comprehensively studied within and between bacterial taxa. We developed a frequency-based network of genus-genus conjugation features and candidate cargo genes from whole-genome sequence data of over 180,000 bacterial genomes, representing 1,345 genera. Using our method, which we refer to as ggMOB, we revealed that over half of the bacterial genomes contained one or more known conjugation features that matched exactly to at least one other genome. Moreover, the proportion of genomes containing these conjugation features varied substantially by genus and conjugation feature. These results and the genus-level network structure can be viewed interactively in the ggMOB interface, which allows for user-defined filtering of conjugation features and candidate cargo genes. Using the network data, we observed that the ratio of AMR gene representation in conjugative versus non-conjugative genomes exceeded 5:1, confirming that conjugation is a critical force for AMR spread across genera. Finally, we demonstrated that clustering genomes by conjugation profile sometimes correlated well with classical phylogenetic structuring; but that in some cases the clustering was highly discordant, suggesting that the importance of the accessory genome in driving bacterial evolution may be highly variable across both time and taxonomy. These results can advance scientific understanding of bacterial evolution, and can be used as a starting point for probing genus-genus gene exchange within complex microbial communities that include unculturable bacteria. ggMOB is publicly available under the GNU licence at https://ruiz-hci-lab.github.io/ggMOB/.

16.
Microbiome ; 10(1): 185, 2022 11 02.
Artículo en Inglés | MEDLINE | ID: mdl-36324140

RESUMEN

BACKGROUND: Metagenomic data can be used to profile high-importance genes within microbiomes. However, current metagenomic workflows produce data that suffer from low sensitivity and an inability to accurately reconstruct partial or full genomes, particularly those in low abundance. These limitations preclude colocalization analysis, i.e., characterizing the genomic context of genes and functions within a metagenomic sample. Genomic context is especially crucial for functions associated with horizontal gene transfer (HGT) via mobile genetic elements (MGEs), for example antimicrobial resistance (AMR). To overcome this current limitation of metagenomics, we present a method for comprehensive and accurate reconstruction of antimicrobial resistance genes (ARGs) and MGEs from metagenomic DNA, termed target-enriched long-read sequencing (TELSeq). RESULTS: Using technical replicates of diverse sample types, we compared TELSeq performance to that of non-enriched PacBio and short-read Illumina sequencing. TELSeq achieved much higher ARG recovery (>1,000-fold) and sensitivity than the other methods across diverse metagenomes, revealing an extensive resistome profile comprising many low-abundance ARGs, including some with public health importance. Using the long reads generated by TELSeq, we identified numerous MGEs and cargo genes flanking the low-abundance ARGs, indicating that these ARGs could be transferred across bacterial taxa via HGT. CONCLUSIONS: TELSeq can provide a nuanced view of the genomic context of microbial resistomes and thus has wide-ranging applications in public, animal, and human health, as well as environmental surveillance and monitoring of AMR. Thus, this technique represents a fundamental advancement for microbiome research and application. Video abstract.


Asunto(s)
Antibacterianos , Metagenoma , Animales , Humanos , Metagenoma/genética , Antibacterianos/farmacología , Genes Bacterianos , Farmacorresistencia Bacteriana/genética , Metagenómica/métodos
17.
Front Bioeng Biotechnol ; 10: 1016408, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36324897

RESUMEN

Nanopore technology enables portable, real-time sequencing of microbial populations from clinical and ecological samples. An emerging healthcare application for Nanopore includes point-of-care, timely identification of antibiotic resistance genes (ARGs) to help developing targeted treatments of bacterial infections, and monitoring resistant outbreaks in the environment. While several computational tools exist for classifying ARGs from sequencing data, to date (2022) none have been developed for mobile devices. We present here KARGAMobile, a mobile app for portable, real-time, easily interpretable analysis of ARGs from Nanopore sequencing. KARGAMobile is the porting of an existing ARG identification tool named KARGA; it retains the same algorithmic structure, but it is optimized for mobile devices. Specifically, KARGAMobile employs a compressed ARG reference database and different internal data structures to save RAM usage. The KARGAMobile app features a friendly graphical user interface that guides through file browsing, loading, parameter setup, and process execution. More importantly, the output files are post-processed to create visual, printable and shareable reports, aiding users to interpret the ARG findings. The difference in classification performance between KARGAMobile and KARGA is minimal (96.2% vs. 96.9% f-measure on semi-synthetic datasets of 1 million reads with known resistance ground truth). Using real Nanopore experiments, KARGAMobile processes on average 1 GB data every 23-48 min (targeted sequencing - metagenomics), with peak RAM usage below 500MB, independently from input file sizes, and an average temperature of 49°C after 1 h of continuous data processing. KARGAMobile is written in Java and is available at https://github.com/Ruiz-HCI-Lab/KargaMobile under the MIT license.

19.
Artif Intell Med ; 130: 102326, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35809965

RESUMEN

Whole genome sequencing (WGS) is quickly becoming the customary means for identification of antimicrobial resistance (AMR) due to its ability to obtain high resolution information about the genes and mechanisms that are causing resistance and driving pathogen mobility. By contrast, traditional phenotypic (antibiogram) testing cannot easily elucidate such information. Yet development of AMR prediction tools from genotype-phenotype data can be biased, since sampling is non-randomized. Sample provenience, period of collection, and species representation can confound the association of genetic traits with AMR. Thus, prediction models can perform poorly on new data with sampling distribution shifts. In this work -under an explicit set of causal assumptions- we evaluate the effectiveness of propensity-based rebalancing and confounding adjustment on antibiotic resistance prediction using genotype-phenotype AMR data from the Pathosystems Resource Integration Center (PATRIC). We select bacterial genotypes (encoded as k-mer signatures, i.e., DNA fragments of length k), country, year, species, and AMR phenotypes for the tetracycline drug class, preparing test data with recent genomes coming from a single country. We test boosted logistic regression (BLR) and random forests (RF) with/without bias-handling. On 10,936 instances, we find evidence of species, location and year imbalance with respect to the AMR phenotype. The crude versus bias-adjusted change in effect of genetic signatures on AMR varies but only moderately (selecting the top 20,000 out of 40+ million k-mers). The area under the receiver operating characteristic (AUROC) of the RF (0.95) is comparable to that of BLR (0.94) on both out-of-bag samples from bootstrap and the external test (n = 1085), where AUROCs do not decrease. We observe a 1 %-5 % gain in AUROC with bias-handling compared to the sole use of genetic signatures. In conclusion, we recommend using causally-informed prediction methods for modeling real-world AMR data; however, traditional adjustment or propensity-based methods may not provide advantage in all use cases and further methodological development should be sought.


Asunto(s)
Antibacterianos , Genoma Bacteriano , Antibacterianos/farmacología , Farmacorresistencia Bacteriana/genética , Genotipo , Pruebas de Sensibilidad Microbiana , Secuenciación Completa del Genoma/métodos
20.
Infect Dis Ther ; 11(5): 1869-1882, 2022 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-35908268

RESUMEN

INTRODUCTION: Urinary tract infections (UTIs) are common infections for which initial antibiotic treatment decisions are empirically based, often without antibiotic susceptibility testing to evaluate resistance, increasing the risk of inappropriate therapy. We hypothesized that models based on electronic health records (EHR) could assist in the identification of patients at higher risk for antibiotic-resistant UTIs and help guide the selection of antimicrobials in hospital and clinic settings. METHODS: EHR from multiple centers in North-Central Florida, including patient demographics, previous diagnoses, prescriptions, and antibiotic susceptibility tests, were obtained for 9990 patients diagnosed with a UTI during 2011-2019. Decision trees, boosted logistic regression (BLR), and random forest models were developed to predict resistance to common antibiotics used for UTI management [sulfamethoxazole-trimethoprim (SXT), nitrofurantoin (NIT), ciprofloxacin (CIP)] and multidrug resistance (MDR). RESULTS: There were 6307 (63.1%) individuals with a UTI caused by a resistant microorganism. Overall, the population was majority female, white, non-Hispanic, and older aged (mean = 60.7 years). The BLR models yielded the highest discriminative ability, as measured by the out-of-bag area under the receiver-operating curve (AUROC), for the resistance outcomes [AUROC = 0.58 (SXT), 0.62 (NIT), 0.64 (CIP), and 0.66 (MDR)]. Variables in the best performing model were sex, history of UTIs, catheterization, renal disease, dementia, hemiplegia/paraplegia, and hypertension. CONCLUSIONS: The discriminative ability of the prediction models was moderate. Nonetheless, these models based solely on EHR demonstrate utility for the identification of patients at higher risk for resistant infections. These models, in turn, may help guide clinical decision-making on the ordering of urine cultures and decisions regarding empiric therapy for these patients.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA