RESUMEN
The COVID-19 pandemic is marked by the successive emergence of new SARS-CoV-2 variants, lineages, and sublineages that outcompete earlier strains, largely due to factors like increased transmissibility and immune escape. We propose DeepAutoCoV, an unsupervised deep learning anomaly detection system, to predict future dominant lineages (FDLs). We define FDLs as viral (sub)lineages that will constitute >10% of all the viral sequences added to the GISAID, a public database supporting viral genetic sequence sharing, in a given week. DeepAutoCoV is trained and validated by assembling global and country-specific data sets from over 16 million Spike protein sequences sampled over a period of ~4 years. DeepAutoCoV successfully flags FDLs at very low frequencies (0.01%-3%), with median lead times of 4-17 weeks, and predicts FDLs between ~5 and ~25 times better than a baseline approach. For example, the B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than a year before it was considered for an updated COVID-19 vaccine. Furthermore, DeepAutoCoV outputs interpretable results by pinpointing specific mutations potentially linked to increased fitness and may provide significant insights for the optimization of public health 'pre-emptive' intervention strategies.
Asunto(s)
COVID-19 , Aprendizaje Profundo , SARS-CoV-2 , SARS-CoV-2/genética , SARS-CoV-2/aislamiento & purificación , COVID-19/virología , COVID-19/epidemiología , Humanos , Glicoproteína de la Espiga del Coronavirus/genética , Predicción/métodos , PandemiasRESUMEN
MOTIVATION: World Health Organization estimates that there were over 10 million cases of tuberculosis (TB) worldwide in 2019, resulting in over 1.4 million deaths, with a worrisome increasing trend yearly. The disease is caused by Mycobacterium tuberculosis (MTB) through airborne transmission. Treatment of TB is estimated to be 85% successful, however, this drops to 57% if MTB exhibits multiple antimicrobial resistance (AMR), for which fewer treatment options are available. RESULTS: We develop a robust machine-learning classifier using both linear and nonlinear models (i.e. LASSO logistic regression (LR) and random forests (RF)) to predict the phenotypic resistance of Mycobacterium tuberculosis (MTB) for a broad range of antibiotic drugs. We use data from the CRyPTIC consortium to train our classifier, which consists of whole genome sequencing and antibiotic susceptibility testing (AST) phenotypic data for 13 different antibiotics. To train our model, we assemble the sequence data into genomic contigs, identify all unique 31-mers in the set of contigs, and build a feature matrix M, where M[i, j] is equal to the number of times the ith 31-mer occurs in the jth genome. Due to the size of this feature matrix (over 350 million unique 31-mers), we build and use a sparse matrix representation. Our method, which we refer to as MTB++, leverages compact data structures and iterative methods to allow for the screening of all the 31-mers in the development of both LASSO LR and RF. MTB++ is able to achieve high discrimination (F-1 >80%) for the first-line antibiotics. Moreover, MTB++ had the highest F-1 score in all but three classes and was the most comprehensive since it had an F-1 score >75% in all but four (rare) antibiotic drugs. We use our feature selection to contextualize the 31-mers that are used for the prediction of phenotypic resistance, leading to some insights about sequence similarity to genes in MEGARes. Lastly, we give an estimate of the amount of data that is needed in order to provide accurate predictions. AVAILABILITY: The models and source code are publicly available on Github at https://github.com/M-Serajian/MTB-Pipeline.
Asunto(s)
Aprendizaje Automático , Mycobacterium tuberculosis , Mycobacterium tuberculosis/genética , Mycobacterium tuberculosis/efectos de los fármacos , Farmacorresistencia Bacteriana/genética , Pruebas de Sensibilidad Microbiana , Antibacterianos/farmacología , Secuenciación Completa del Genoma/métodos , Genoma Bacteriano , HumanosRESUMEN
In the midst of an outbreak or sustained epidemic, reliable prediction of transmission risks and patterns of spread is critical to inform public health programs. Projections of transmission growth or decline among specific risk groups can aid in optimizing interventions, particularly when resources are limited. Phylogenetic trees have been widely used in the detection of transmission chains and high-risk populations. Moreover, tree topology and the incorporation of population parameters (phylodynamics) can be useful in reconstructing the evolutionary dynamics of an epidemic across space and time among individuals. We now demonstrate the utility of phylodynamic trees for transmission modeling and forecasting, developing a phylogeny-based deep learning system, referred to as DeepDynaForecast. Our approach leverages a primal-dual graph learning structure with shortcut multi-layer aggregation, which is suited for the early identification and prediction of transmission dynamics in emerging high-risk groups. We demonstrate the accuracy of DeepDynaForecast using simulated outbreak data and the utility of the learned model using empirical, large-scale data from the human immunodeficiency virus epidemic in Florida between 2012 and 2020. Our framework is available as open-source software (MIT license) at github.com/lab-smile/DeepDynaForcast.
Asunto(s)
Biología Computacional , Aprendizaje Profundo , Epidemias , Filogenia , Humanos , Epidemias/estadística & datos numéricos , Biología Computacional/métodos , Infecciones por VIH/transmisión , Infecciones por VIH/epidemiología , Programas Informáticos , Florida/epidemiología , Algoritmos , Simulación por Computador , Brotes de Enfermedades/estadística & datos numéricosRESUMEN
Antimicrobial resistance (AMR) is considered a critical threat to public health, and genomic/metagenomic investigations featuring high-throughput analysis of sequence data are increasingly common and important. We previously introduced MEGARes, a comprehensive AMR database with an acyclic hierarchical annotation structure that facilitates high-throughput computational analysis, as well as AMR++, a customized bioinformatic pipeline specifically designed to use MEGARes in high-throughput analysis for characterizing AMR genes (ARGs) in metagenomic sequence data. Here, we present MEGARes v3.0, a comprehensive database of published ARG sequences for antimicrobial drugs, biocides, and metals, and AMR++ v3.0, an update to our customized bioinformatic pipeline for high-throughput analysis of metagenomic data (available at MEGLab.org). Database annotations have been expanded to include information regarding specific genomic locations for single-nucleotide polymorphisms (SNPs) and insertions and/or deletions (indels) when required by specific ARGs for resistance expression, and the updated AMR++ pipeline uses this information to check for presence of resistance-conferring genetic variants in metagenomic sequenced reads. This new information encompasses 337 ARGs, whose resistance-conferring variants could not previously be confirmed in such a manner. In MEGARes 3.0, the nodes of the acyclic hierarchical ontology include 4 antimicrobial compound types, 59 resistance classes, 233 mechanisms and 1448 gene groups that classify the 8733 accessions.
Asunto(s)
Antibacterianos , Antiinfecciosos , Antibacterianos/farmacología , Farmacorresistencia Bacteriana/genética , Programas Informáticos , Secuenciación de Nucleótidos de Alto RendimientoRESUMEN
Antimicrobial resistance (AMR) is a growing threat to public health and farming at large. In clinical and veterinary practice, timely characterization of the antibiotic susceptibility profile of bacterial infections is a crucial step in optimizing treatment. High-throughput sequencing is a promising option for clinical point-of-care and ecological surveillance, opening the opportunity to develop genotyping-based AMR determination as a possibly faster alternative to phenotypic testing. In the present work, we compare the performance of state-of-the-art methods for detection of AMR using high-throughput sequencing data from clinical settings. We consider five computational approaches based on alignment (AMRPlusPlus), deep learning (DeepARG), k-mer genomic signatures (KARGA, ResFinder) or hidden Markov models (Meta-MARC). We use an extensive collection of 585 isolates with available AMR resistance profiles determined by phenotypic tests across nine antibiotic classes. We show how the prediction landscape of AMR classifiers is highly heterogeneous, with balanced accuracy varying from 0.40 to 0.92. Although some algorithms-ResFinder, KARGA and AMRPlusPlus-exhibit overall better balanced accuracy than others, the high per-AMR-class variance and related findings suggest that: (1) all algorithms might be subject to sampling bias both in data repositories used for training and experimental/clinical settings; and (2) a portion of clinical samples might contain uncharacterized AMR genes that the algorithms-mostly trained on known AMR genes-fail to generalize upon. These results lead us to formulate practical advice for software configuration and application, and give suggestions for future study designs to further develop AMR prediction tools from proof-of-concept to bedside.
Asunto(s)
Antibacterianos , Farmacorresistencia Bacteriana , Antibacterianos/farmacología , Farmacorresistencia Bacteriana/genética , Empleo , Secuenciación de Nucleótidos de Alto Rendimiento , Pruebas de Sensibilidad MicrobianaRESUMEN
AIM: To develop an automated computable phenotype (CP) algorithm for identifying diabetes cases in children and adolescents using electronic health records (EHRs) from the UF Health System. MATERIALS AND METHODS: The CP algorithm was iteratively derived based on structured data from EHRs (UF Health System 2012-2020). We randomly selected 536 presumed cases among individuals aged <18 years who had (1) glycated haemoglobin levels ≥ 6.5%; or (2) fasting glucose levels ≥126 mg/dL; or (3) random plasma glucose levels ≥200 mg/dL; or (4) a diabetes-related diagnosis code from an inpatient or outpatient encounter; or (5) prescribed, administered, or dispensed diabetes-related medication. Four reviewers independently reviewed the patient charts to determine diabetes status and type. RESULTS: Presumed cases without type 1 (T1D) or type 2 diabetes (T2D) diagnosis codes were categorized as non-diabetes/other types of diabetes. The rest were categorized as T1D if the most recent diagnosis was T1D, or otherwise categorized as T2D if the most recent diagnosis was T2D. Next, we applied a list of diagnoses and procedures that can determine diabetes type (e.g., steroid use suggests induced diabetes) to correct misclassifications from Step 1. Among the 536 reviewed cases, 159 and 64 had T1D and T2D, respectively. The sensitivity, specificity, and positive predictive values of the CP algorithm were 94%, 98% and 96%, respectively, for T1D and 95%, 95% and 73% for T2D. CONCLUSION: We developed a highly accurate EHR-based CP for diabetes in youth based on EHR data from UF Health. Consistent with prior studies, T2D was more difficult to identify using these methods.
RESUMEN
Substance use disorder (SUD), a common comorbidity among people with HIV (PWH), adversely affects HIV clinical outcomes and HIV-related comorbidities. However, less is known about the incidence of different chronic conditions, changes in overall comorbidity burden, and health care utilization by SUD status and patterns among PWH in Florida, an area disproportionately affected by the HIV epidemic. We used electronic health records (EHR) from a large southeastern US consortium, the OneFlorida + clinical research data network. We identified a cohort of PWH with 3 + years of EHRs after the first visit with HIV diagnosis. International Classification of Diseases (ICD) codes were used to identify SUD and comorbidity conditions listed in the Charlson comorbidity index (CCI). A total of 42,271 PWH were included (mean age 44.5, 52% Black, 45% female). The prevalence SUD among PWH was 45.1%. Having a SUD diagnosis among PWH was associated with a higher incidence for most of the conditions listed on the CCI and faster increase in CCI score overtime (rate ratio = 1.45, 95%CI 1.42, 1.49). SUD in PWH was associated with a higher mean number of any care visits (21.7 vs. 14.8) and more frequent emergency department (ED, 3.5 vs. 2.0) and inpatient (8.5 vs. 24.5) visits compared to those without SUD. SUD among PWH was associated with a higher comorbidity burden and more frequent ED and inpatient visits than PWH without a diagnosis of SUD. The high SUD prevalence and comorbidity burden call for improved SUD screening, treatment, and integrated care among PWH.
Asunto(s)
Comorbilidad , Infecciones por VIH , Aceptación de la Atención de Salud , Trastornos Relacionados con Sustancias , Humanos , Femenino , Florida/epidemiología , Masculino , Infecciones por VIH/epidemiología , Adulto , Trastornos Relacionados con Sustancias/epidemiología , Persona de Mediana Edad , Aceptación de la Atención de Salud/estadística & datos numéricos , Prevalencia , Incidencia , Registros Electrónicos de Salud , Costo de EnfermedadRESUMEN
HIV-related stigma is a key contributor to poor HIV-related health outcomes. The purpose of this study is to explore implementing a stigma measure into routine HIV care focusing on the 10-item Medical Monitoring Project measure as a proposed measure. Healthcare providers engaged in HIV-related care in Florida were recruited. Participants completed an interview about their perceptions of measures to assess stigma during clinical care. The analysis followed a directed content approach. Fifteen participants completed the interviews (87% female, 47% non-Hispanic White, case manager 40%). Most providers thought that talking about stigma would be helpful (89%). Three major themes emerged from the analysis: acceptability, subscales of interest, and utility. In acceptability, participants mentioned that assessing stigma could encourage patient-centered care and serve as a conversation starter, but some mentioned not having enough time. Participants thought that the disclosure concerns and negative self-image subscales were most relevant. Some worried they would not have resources for patients or that some issues were beyond their influence. Participants were generally supportive of routinely addressing HIV-related stigma in clinical care, but were concerned that resources, especially to address concerns about disclosure and negative self-image, were not available.
Asunto(s)
Infecciones por VIH , Humanos , Femenino , Masculino , Florida , Estigma Social , Ansiedad , RevelaciónRESUMEN
Long-acting injectable (LAI) antiretroviral therapy (ART) is available to people with HIV (PWH), but it is unknown which PWH prefer this option. Using the Andersen Behavioral Model this study identifies characteristics of PWH with greater preference for LAI ART. Cross-sectional data from the Florida Cohort, which enrolled adult PWH from community-based clinics included information on predisposing (demographics), enabling (transportation, income), and need (ART adherence <90%) factors. ART preference was assessed via a single question (prefer pills, quarterly LAI, or no preference). Confounder-adjusted multinomial logistic regressions compared those who preferred pills to the other preference options, with covariates identified using directed acyclic graphs. Overall, 314 participants responded (40% non-Hispanic Black, 62% assigned male, 63% aged 50+). Most (63%) preferred the hypothetical LAI, 23% preferred pills, and 14% had no preference. PWH with access to a car (aRRR 1.97 95%CI 1.05-3.71), higher income (aRRR 2.55 95%CI 1.04-6.25), and suboptimal ART adherence (aRRR 7.41 95% CI 1.52-36.23) were more likely to prefer the LAI, while those who reported having no social network were less likely to prefer the LAI (aRRR 0.32 95% CI 0.11-0.88). Overall LAI interest was high, with greater preference associated with enabling and need factors.
Asunto(s)
Fármacos Anti-VIH , Infecciones por VIH , Cumplimiento de la Medicación , Prioridad del Paciente , Humanos , Masculino , Femenino , Infecciones por VIH/tratamiento farmacológico , Florida , Persona de Mediana Edad , Estudios Transversales , Adulto , Cumplimiento de la Medicación/estadística & datos numéricos , Fármacos Anti-VIH/uso terapéutico , Fármacos Anti-VIH/administración & dosificación , Inyecciones , Preparaciones de Acción Retardada/uso terapéuticoRESUMEN
BACKGROUND: Racial/ethnic disparities in the HIV care continuum have been well documented in the US, with especially striking inequalities in viral suppression rates between White and Black persons with HIV (PWH). The South is considered an epicenter of the HIV epidemic in the US, with the largest population of PWH living in Florida. It is unclear whether any disparities in viral suppression or immune reconstitution-a clinical outcome highly correlated with overall prognosis-have changed over time or are homogenous geographically. In this analysis, we 1) investigate longitudinal trends in viral suppression and immune reconstitution among PWH in Florida, 2) examine the impact of socio-ecological factors on the association between race/ethnicity and clinical outcomes, 3) explore spatial and temporal variations in disparities in clinical outcomes. METHODS: Data were obtained from the Florida Department of Health for 42,369 PWH enrolled in the Ryan White program during 2008-2020. We linked the data to county-level socio-ecological variables available from County Health Rankings. GEE models were fit to assess the effect of race/ethnicity on immune reconstitution and viral suppression longitudinally. Poisson Bayesian hierarchical models were fit to analyze geographic variations in racial/ethnic disparities while adjusting for socio-ecological factors. RESULTS: Proportions of PWH who experienced viral suppression and immune reconstitution rose by 60% and 45%, respectively, from 2008-2020. Odds of immune reconstitution and viral suppression were significantly higher among White [odds ratio =2.34, 95% credible interval=2.14-2.56; 1.95 (1.85-2.05)], and Hispanic [1.70 (1.54-1.87); 2.18(2.07-2.31)] PWH, compared with Black PWH. These findings remained unchanged after accounting for socio-ecological factors. Rural and urban counties in north-central Florida saw the largest racial/ethnic disparities. CONCLUSIONS: There is persistent, spatially heterogeneous, racial/ethnic disparity in HIV clinical outcomes in Florida. This disparity could not be explained by socio-ecological factors, suggesting that further research on modifiable factors that can improve HIV outcomes among Black and Hispanic PWH in Florida is needed.
Asunto(s)
Etnicidad , Infecciones por VIH , Humanos , Teorema de Bayes , Florida/epidemiología , Disparidades en Atención de Salud , Hispánicos o Latinos , Infecciones por VIH/epidemiología , Blanco , Negro o AfroamericanoRESUMEN
SUMMARY: TARDiS is a novel phylogenetic tool for optimal genetic subsampling. It optimizes both genetic diversity and temporal distribution through a genetic algorithm. AVAILABILITY AND IMPLEMENTATION: TARDiS, along with example datasets and a user manual, is available at https://github.com/smarini/tardis-phylogenetics.
Asunto(s)
Genoma Viral , Programas Informáticos , Filogenia , Variación GenéticaRESUMEN
BACKGROUND: Prognostic models of hospital-induced delirium, that include potential predisposing and precipitating factors, may be used to identify vulnerable patients and inform the implementation of tailored preventive interventions. It is recommended that, in prediction model development studies, candidate predictors are selected on the basis of existing knowledge, including knowledge from clinical practice. The purpose of this article is to describe the process of identifying and operationalizing candidate predictors of hospital-induced delirium for application in a prediction model development study using a practice-based approach. METHODS: This study is part of a larger, retrospective cohort study that is developing prognostic models of hospital-induced delirium for medical-surgical older adult patients using structured data from administrative and electronic health records. First, we conducted a review of the literature to identify clinical concepts that had been used as candidate predictors in prognostic model development-and-validation studies of hospital-induced delirium. Then, we consulted a multidisciplinary task force of nine members who independently judged whether each clinical concept was associated with hospital-induced delirium. Finally, we mapped the clinical concepts to the administrative and electronic health records and operationalized our candidate predictors. RESULTS: In the review of 34 studies, we identified 504 unique clinical concepts. Two-thirds of the clinical concepts (337/504) were used as candidate predictors only once. The most common clinical concepts included age (31/34), sex (29/34), and alcohol use (22/34). 96% of the clinical concepts (484/504) were judged to be associated with the development of hospital-induced delirium by at least two members of the task force. All of the task force members agreed that 47 or 9% of the 504 clinical concepts were associated with hospital-induced delirium. CONCLUSIONS: Heterogeneity among candidate predictors of hospital-induced delirium in the literature suggests a still evolving list of factors that contribute to the development of this complex phenomenon. We demonstrated a practice-based approach to variable selection for our model development study of hospital-induced delirium. Expert judgement of variables enabled us to categorize the variables based on the amount of agreement among the experts and plan for the development of different models, including an expert-model and data-driven model.
Asunto(s)
Comités Consultivos , Delirio , Humanos , Anciano , Estudios Retrospectivos , Consumo de Bebidas Alcohólicas , Hospitales , Delirio/diagnósticoRESUMEN
BACKGROUND: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Delta variant has caused a dramatic resurgence in infections in the United Sates, raising questions regarding potential transmissibility among vaccinated individuals. METHODS: Between October 2020 and July 2021, we sequenced 4439 SARS-CoV-2 full genomes, 23% of all known infections in Alachua County, Florida, including 109 vaccine breakthrough cases. Univariate and multivariate regression analyses were conducted to evaluate associations between viral RNA burden and patient characteristics. Contact tracing and phylogenetic analysis were used to investigate direct transmissions involving vaccinated individuals. RESULTS: The majority of breakthrough sequences with lineage assignment were classified as Delta variants (74.6%) and occurred, on average, about 3 months (104â ±â 57.5 days) after full vaccination, at the same time (June-July 2021) of Delta variant exponential spread within the county. Six Delta variant transmission pairs between fully vaccinated individuals were identified through contact tracing, 3 of which were confirmed by phylogenetic analysis. Delta breakthroughs exhibited broad viral RNA copy number values during acute infection (interquartile range, 1.2-8.64 Log copies/mL), on average 38% lower than matched unvaccinated patients (3.29-10.81 Log copies/mL, Pâ <â .00001). Nevertheless, 49% to 50% of all breakthroughs, and 56% to 60% of Delta-infected breakthroughs exhibited viral RNA levels above the transmissibility threshold (4 Log copies/mL) irrespective of time after vaccination. CONCLUSIONS: Delta infection transmissibility and general viral RNA quantification patterns in vaccinated individuals suggest limited levels of sterilizing immunity that need to be considered by public health policies. In particular, ongoing evaluation of vaccine boosters should specifically address whether extra vaccine doses curb breakthrough contribution to epidemic spread.
Asunto(s)
COVID-19 , Vacunas Virales , Humanos , SARS-CoV-2/genética , ARN Viral/genética , Filogenia , Florida/epidemiología , COVID-19/epidemiología , COVID-19/prevención & control , VacunaciónRESUMEN
HIV care engagement is a dynamic process. We employed group-based trajectory modeling to examine longitudinal patterns in care engagement among people who were newly diagnosed with HIV and enrolled in the Ryan White program in Florida (n = 9,755) between 2010 and 2015. Five trajectories were identified (47.9% "in care" with 1-2 care visit(s) per 6 months, 18.0% "frequent care" with 3 or more care visits per 6 months, 11.0% "re-engage", 11.0% "gradual drop out", 12.6% "early dropout") based on the number of care attendances (including outpatient/case management visits, viral load or CD4 test) for each six-month during the first five years since diagnosis. Relative to "in care", people in the "frequent care" trajectory were more likely to be Hispanic/Latino and older at HIV diagnosis, whereas people in the three suboptimal care retention trajectories were more likely to be younger. Area deprivation index, rurality, and county health rankings were also strongly associated with care trajectories. Individual- and community-level factors associated to the three suboptimal care retention trajectories, if confirmed to be causative and actionable, could be prioritized to improve HIV care engagement.
Asunto(s)
Infecciones por VIH , Retención en el Cuidado , Manejo de Caso , Florida/epidemiología , Infecciones por VIH/diagnóstico , Infecciones por VIH/tratamiento farmacológico , Infecciones por VIH/epidemiología , Humanos , Carga ViralRESUMEN
BACKGROUND: Persons living with human immunodeficiency virus (HIV) with resistance to antiretroviral therapy are vulnerable to adverse HIV-related health outcomes and can contribute to transmission of HIV drug resistance (HIVDR) when nonvirally suppressed. The degree to which HIVDR contributes to disease burden in Florida-the US state with the highest HIV incidence- is unknown. METHODS: We explored sociodemographic, ecological, and spatiotemporal associations of HIVDR. HIV-1 sequences (nâ =â 34 447) collected during 2012-2017 were obtained from the Florida Department of Health. HIVDR was categorized by resistance class, including resistance to nucleoside reverse-transcriptase , nonnucleoside reverse-transcriptase , protease , and integrase inhibitors. Multidrug resistance and transmitted drug resistance were also evaluated. Multivariable fixed-effects logistic regression models were fitted to associate individual- and county-level sociodemographic and ecological health indicators with HIVDR. RESULTS: The HIVDR prevalence was 19.2% (nucleoside reverse-transcriptase inhibitor resistance), 29.7% (nonnucleoside reverse-transcriptase inhibitor resistance), 6.6% (protease inhibitor resistance), 23.5% (transmitted drug resistance), 13.2% (multidrug resistance), and 8.2% (integrase strand transfer inhibitor resistance), with significant variation by Florida county. Individuals who were older, black, or acquired HIV through mother-to-child transmission had significantly higher odds of HIVDR. HIVDR was linked to counties with lower socioeconomic status, higher rates of unemployment, and poor mental health. CONCLUSIONS: Our findings indicate that HIVDR prevalence is higher in Florida than aggregate North American estimates with significant geographic and socioecological heterogeneity.
Asunto(s)
Fármacos Anti-VIH , Farmacorresistencia Viral , Infecciones por VIH , VIH-1 , Fármacos Anti-VIH/uso terapéutico , ARN Polimerasas Dirigidas por ADN , Florida/epidemiología , Infecciones por VIH/tratamiento farmacológico , Infecciones por VIH/epidemiología , VIH-1/efectos de los fármacos , VIH-1/genética , Humanos , Transmisión Vertical de Enfermedad Infecciosa , Mutación , Nucleósidos/uso terapéutico , Estudios Retrospectivos , Inhibidores de la Transcriptasa Inversa/uso terapéutico , Factores Sociodemográficos , Análisis Espacio-TemporalRESUMEN
BACKGROUND: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. RESULTS: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob . CONCLUSIONS: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.
Asunto(s)
Algoritmos , Programas InformáticosRESUMEN
MOTIVATION: Oxford Nanopore technologies (ONT) add miniaturization and real time to high-throughput sequencing. All available software for ONT data analytics run on cloud/clusters or personal computers. Instead, a linchpin to true portability is software that works on mobile devices of internet connections. Smartphones' and tablets' chipset/memory/operating systems differ from desktop computers, but software can be recompiled. We sought to understand how portable current ONT analysis methods are. RESULTS: Several tools, from base-calling to genome assembly, were ported and benchmarked on an Android smartphone. Out of 23 programs, 11 succeeded. Recompilation failures included lack of standard headers and unsupported instruction sets. Only DSK, BCALM2 and Kraken were able to process files up to 16 GB, with linearly scaling CPU-times. However, peak CPU temperatures were high. In conclusion, the portability scenario is not favorable. Given the fast market growth, attention of developers to ARM chipsets and Android/iOS is warranted, as well as initiatives to implement mobile-specific libraries. AVAILABILITY AND IMPLEMENTATION: The source code is freely available at: https://github.com/marco-oliva/portable-nanopore-analytics.
Asunto(s)
Nanoporos , Benchmarking , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
Despite improvements in antiretroviral therapy, human immunodeficiency virus type 1 (HIV-1)-associated neurocognitive disorders (HAND) remain prevalent in subjects undergoing therapy. HAND significantly affects individuals' quality of life, as well as adherence to therapy, and, despite the increasing understanding of neuropathogenesis, no definitive diagnostic or prognostic marker has been identified. We investigated transcriptomic profiles in frontal cortex tissues of Simian immunodeficiency virus (SIV)-infected Rhesus macaques sacrificed at different stages of infection. Gene expression was compared among SIV-infected animals (n = 11), with or without CD8+ lymphocyte depletion, based on detectable (n = 6) or non-detectable (n = 5) presence of the virus in frontal cortex tissues. Significant enrichment in activation of monocyte and macrophage cellular pathways was found in animals with detectable brain infection, independently from CD8+ lymphocyte depletion. In addition, transcripts of four poly (ADP-ribose) polymerases (PARPs) were up-regulated in the frontal cortex, which was confirmed by real-time polymerase chain reaction. Our results shed light on involvement of PARPs in SIV infection of the brain and their role in SIV-associated neurodegenerative processes. Inhibition of PARPs may provide an effective novel therapeutic target for HIV-related neuropathology.
Asunto(s)
Trastornos del Conocimiento/virología , Lóbulo Frontal/metabolismo , Lóbulo Frontal/virología , Poli(ADP-Ribosa) Polimerasas/metabolismo , Síndrome de Inmunodeficiencia Adquirida del Simio/metabolismo , Animales , Trastornos del Conocimiento/metabolismo , Macaca mulatta , Masculino , Síndrome de Inmunodeficiencia Adquirida del Simio/virologíaRESUMEN
Learning causal effects from observational data, e.g. estimating the effect of a treatment on survival by data-mining electronic health records (EHRs), can be biased due to unmeasured confounders, mediators, and colliders. When the causal dependencies among features/covariates are expressed in the form of a directed acyclic graph, using do-calculus it is possible to identify one or more adjustment sets for eliminating the bias on a given causal query under certain assumptions. However, prior knowledge of the causal structure might be only partial; algorithms for causal structure discovery often provide ambiguous solutions, and their computational complexity becomes practically intractable when the feature sets grow large. We hypothesize that the estimation of the true causal effect of a causal query on to an outcome can be approximated as an ensemble of lower complexity estimators, namely bagged random causal networks. A bagged random causal network is an ensemble of subnetworks constructed by sampling the feature subspaces (with the query, the outcome, and a random number of other features), drawing conditional dependencies among the features, and inferring the corresponding adjustment sets. The causal effect can be then estimated by any regression function of the outcome by the query paired with the adjustment sets. Through simulations and a real-world clinical dataset (class III malocclusion data), we show that the bagged estimator is -in most cases- consistent with the true causal effect if the structure is known, has a good variance/bias trade-off when the structure is unknown (estimated using heuristics), has lower computational complexity than learning a full network, and outperforms boosted regression. In conclusion, the bagged random causal network is well-suited to estimate query-target causal effects from observational studies on EHR and other high-dimensional biomedical databases.
Asunto(s)
Algoritmos , Sesgo , CausalidadRESUMEN
An individual's health and conditions are associated with a complex interplay between the individual's genetics and his or her exposures to both internal and external environments. Much attention has been placed on characterizing of the genome in the past; nevertheless, genetics only account for about 10% of an individual's health conditions, while the remaining appears to be determined by environmental factors and gene-environment interactions. To comprehensively understand the causes of diseases and prevent them, environmental exposures, especially the external exposome, need to be systematically explored. However, the heterogeneity of the external exposome data sources (e.g., same exposure variables using different nomenclature in different data sources, or vice versa, two variables have the same or similar name but measure different exposures in reality) increases the difficulty of analyzing and understanding the associations between environmental exposures and health outcomes. To solve the issue, the development of semantic standards using an ontology-driven approach is inevitable because ontologies can (1) provide a unambiguous and consistent understanding of the variables in heterogeneous data sources, and (2) explicitly express and model the context of the variables and relationships between those variables. We conducted a review of existing ontology for the external exposome and found only four relevant ontologies. Further, the four existing ontologies are limited: they (1) often ignored the spatiotemporal characteristics of external exposome data, and (2) were developed in isolation from other conceptual frameworks (e.g., the socioecological model and the social determinants of health). Moving forward, the combination of multi-domain and multi-scale data (i.e., genome, phenome and exposome at different granularity) and different conceptual frameworks is the basis of health outcomes research in the future.