Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
Bioinformatics ; 37(6): 767-774, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-33051654

RESUMO

MOTIVATION: Circadian rhythms are approximately 24-h endogenous cycles that control many biological functions. To identify these rhythms, biological samples are taken over circadian time and analyzed using a single omics type, such as transcriptomics or proteomics. By comparing data from these single omics approaches, it has been shown that transcriptional rhythms are not necessarily conserved at the protein level, implying extensive circadian post-transcriptional regulation. However, as proteomics methods are known to be noisier than transcriptomic methods, this suggests that previously identified arrhythmic proteins with rhythmic transcripts could have been missed due to noise and may not be due to post-transcriptional regulation. RESULTS: To determine if one can use information from less-noisy transcriptomic data to inform rhythms in more-noisy proteomic data, and thus more accurately identify rhythms in the proteome, we have created the Multi-Omics Selection with Amplitude Independent Criteria (MOSAIC) application. MOSAIC combines model selection and joint modeling of multiple omics types to recover significant circadian and non-circadian trends. Using both synthetic data and proteomic data from Neurospora crassa, we showed that MOSAIC accurately recovers circadian rhythms at higher rates in not only the proteome but the transcriptome as well, outperforming existing methods for rhythm identification. In addition, by quantifying non-circadian trends in addition to circadian trends in data, our methodology allowed for the recognition of the diversity of circadian regulation as compared to non-circadian regulation. AVAILABILITY AND IMPLEMENTATION: MOSAIC's full interface is available at https://github.com/delosh653/MOSAIC. An R package for this functionality, mosaic.find, can be downloaded at https://CRAN.R-project.org/package=mosaic.find. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Neurospora crassa , Proteômica , Ritmo Circadiano/genética , Neurospora crassa/genética , Proteoma , Transcriptoma
2.
Bioinformatics ; 36(3): 773-781, 2020 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-31384918

RESUMO

MOTIVATION: Time courses utilizing genome scale data are a common approach to identifying the biological pathways that are controlled by the circadian clock, an important regulator of organismal fitness. However, the methods used to detect circadian oscillations in these datasets are not able to accommodate changes in the amplitude of the oscillations over time, leading to an underestimation of the impact of the clock on biological systems. RESULTS: We have created a program to efficaciously identify oscillations in large-scale datasets, called the Extended Circadian Harmonic Oscillator application, or ECHO. ECHO utilizes an extended solution of the fixed amplitude oscillator that incorporates the amplitude change coefficient. Employing synthetic datasets, we determined that ECHO outperforms existing methods in detecting rhythms with decreasing oscillation amplitudes and in recovering phase shift. Rhythms with changing amplitudes identified from published biological datasets revealed distinct functions from those oscillations that were harmonic, suggesting purposeful biologic regulation to create this subtype of circadian rhythms. AVAILABILITY AND IMPLEMENTATION: ECHO's full interface is available at https://github.com/delosh653/ECHO. An R package for this functionality, echo.find, can be downloaded at https://CRAN.R-project.org/package=echo.find. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Relógios Circadianos , Ritmo Circadiano
3.
Methods ; 179: 101-110, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32446958

RESUMO

We propose a machine learning driven approach to derive insights from observational healthcare data to improve public health outcomes. Our goal is to simultaneously identify patient subpopulations with differing health risks and to find those risk factors within each subpopulation. We develop two supervised mixture of experts models: a Supervised Gaussian Mixture model (SGMM) for general features and a Supervised Bernoulli Mixture model (SBMM) tailored to binary features. We demonstrate the two approaches on an analysis of high cost drivers of Medicaid expenditures for inpatient stays. We focus on the three diagnostic categories that accounted for the highest percentage of inpatient expenditures in New York State (NYS) in 2016. When compared with state-of-the-art learning methods (random forests, boosting, neural networks), our approaches provide comparable prediction performance while also extracting insightful subpopulation structure and risk factors. For problems with binary features the proposed SBMM provides as good or better performance than alternative methods while offering insightful explanations. Our results indicate the promise of such approaches for extracting population health insights from electronic health care records.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Informática Médica/métodos , Saúde da População/estatística & dados numéricos , Aprendizado de Máquina Supervisionado , Registros Eletrônicos de Saúde/estatística & dados numéricos , Humanos , Distribuição Normal
4.
Entropy (Basel) ; 23(9)2021 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-34573790

RESUMO

Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must "fairly" represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.

5.
J Chem Inf Model ; 53(12): 3352-66, 2013 Dec 23.
Artigo em Inglês | MEDLINE | ID: mdl-24261543

RESUMO

Computational methods that can identify CYP-mediated sites of metabolism (SOMs) of drug-like compounds have become required tools for early stage lead optimization. In recent years, methods that combine CYP binding site features with CYP/ligand binding information have been sought in order to increase the prediction accuracy of such hybrid models over those that use only one representation. Two challenges that any hybrid ligand/structure-based method must overcome are (1) identification of the best binding pose for a specific ligand with a given CYP and (2) appropriately incorporating the results of docking with ligand reactivity. To address these challenges we have created Docking-Regioselectivity-Predictor (DR-Predictor)--a method that incorporates flexible docking-derived information with specialized electronic reactivity and multiple-instance-learning methods to predict CYP-mediated SOMs. In this study, the hybrid ligand-structure-based DR-Predictor method was tested on substrate sets for CYP 1A2 and CYP 2A6. For these data, the DR-Predictor model was found to identify the experimentally observed SOM within the top two predicted rank-positions for 86% of the 261 1A2 substrates and 83% of the 100 2A6 substrates. Given the accuracy and extendibility of the DR-Predictor method, we anticipate that it will further facilitate the prediction of CYP metabolism liabilities and aid in in-silico ADMET assessment of novel structures.


Assuntos
Inteligência Artificial , Hidrocarboneto de Aril Hidroxilases/química , Citocromo P-450 CYP1A2/química , Simulação de Acoplamento Molecular , Bibliotecas de Moléculas Pequenas/química , Hidrocarboneto de Aril Hidroxilases/metabolismo , Biotransformação , Domínio Catalítico , Citocromo P-450 CYP1A2/metabolismo , Citocromo P-450 CYP2A6 , Humanos , Ligação de Hidrogênio , Interações Hidrofóbicas e Hidrofílicas , Ligantes , Ligação Proteica , Bibliotecas de Moléculas Pequenas/metabolismo , Relação Estrutura-Atividade , Especificidade por Substrato , Termodinâmica
6.
IEEE J Biomed Health Inform ; 27(2): 1084-1095, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36355718

RESUMO

Randomized clinical trial (RCT) studies are the gold standard for scientific evidence on treatment benefits to patients. RCT outcomes may not be generalizable to clinical practice if the trial population is not representative of the patients for which the treatment is intended. Specifically, enrollment plans may not adequately include groups of patients with protected attributes, such as gender, race, or ethnicity. Inequities in RCTs are a major concern for funding agencies such as the National Institutes of Health (NIH) and for policy makers. We address this challenge by proposing a goal-programming approach, explicitly integrating measurable enrollment goals, to design equitable enrollment plans for RCTs. We evaluate our model in both single and multisite settings using the enrollment criteria and study population from the Systolic Blood Pressure Intervention Trial (SPRINT) study. Our model can successfully generate equitable enrollment plans that satisfy multiple goals such as sample representativeness and minimum total financial cost. Our model can detect deviations from a target plan during the enrollment process and update the plan to reduce deviations in the remaining process. Finally, through appropriate site selection in the planning stage, the model can demonstrate the possibility of enrolling a nationally representative study population if geographic constraints exist in multisite recruitment (e.g., clinical centers in a particular region). Our model can be used to prospectively produce and retrospectively evaluate how equitable enrollment plans are based on subjects' protected attributes, and it allows researchers to provide justifications on validity of scientific analysis and evaluation of subgroup disparities.


Assuntos
Objetivos , Projetos de Pesquisa , Humanos
7.
AMIA Annu Symp Proc ; 2023: 530-539, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38222411

RESUMO

Randomized Clinical Trials (RCTs) measure an intervention's efficacy, but they may not be generalizable to a desired target population if the RCT is not equitable. Thus, representativeness of RCTs has become a national priority. Synthetic Controls (SCs) that incorporate observational data into RCTs have shown great potential to produce more efficient studies, but their equity is rarely considered. Here, we examine how to improve treatment effect estimation and equity of a trial by augmenting "on-trial" concurrent controls with SCs to form a Hybrid Control Arm (HCA). We introduce FRESCA - a framework to evaluate HCA construction methods using RCT simulations. FRESCA shows that doing propensity and equity adjustment when constructing the HCA leads to accurate population treatment effect estimates while meeting equity goals with potentially less "on-trial" patients. This work represents the first investigation of equity in HCA design that provides definitions, metrics, compelling questions, and resources for future work.

8.
PLoS One ; 18(11): e0290692, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37972008

RESUMO

Disparities in healthcare access and utilization associated with demographic and socioeconomic status hinder advancement of health equity. Thus, we designed a novel equity-focused approach to quantify variations of healthcare access/utilization from the expectation in national target populations. We additionally applied survey-weighted logistic regression models, to identify factors associated with usage of a particular type of health care. To facilitate generation of analysis datasets, we built an National Health and Nutrition Examination Survey (NHANES) knowledge graph to help automate source-level dynamic analyses across different survey years and subjects' characteristics. We performed a cross-sectional subgroup disparity analysis of 2013-2018 NHANES on U.S. adults for receipt of diabetes treatments and vaccines against Hepatitis A (HAV), Hepatitis B (HBV), and Human Papilloma (HPV). Results show that in populations with hemoglobin A1c level ≥6%, patients with non-private insurance were less likely to receive newer and more beneficial antidiabetic medications; being Asian further exacerbated these disparities. For widely used drugs such as insulin, Asians experienced insignificant disparities in odds of prescription compared to White patients but received highly inadequate treatments with regard to their distribution in U.S. diabetic population. Vaccination rates were associated with some demographic/socioeconomic factors but not the others at different degrees for different diseases. For instance, while equity scores increase with rising education levels for HBV, they decrease with rising wealth levels for HPV. Among women vaccinated against HPV, minorities and poor communities usually received Cervarix while non-Hispanic White and higher-income groups received the more comprehensive Gardasil vaccine. Our study identified and quantified the impact of determinants of healthcare utilization for antidiabetic medications and vaccinations. Our new methods for semantics-aware disparity analysis of NHANES data could be readily generalized to other public health goals to support more rapid identification of disparities and development of policies, thus advancing health equity.


Assuntos
Hepatite A , Infecções por Papillomavirus , Adulto , Humanos , Feminino , Estados Unidos , Inquéritos Nutricionais , Estudos Transversais , Infecções por Papillomavirus/prevenção & controle , Fatores Socioeconômicos , Acessibilidade aos Serviços de Saúde , Disparidades em Assistência à Saúde , Hipoglicemiantes , Demografia
9.
Front Nutr ; 10: 1196520, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37305078

RESUMO

Introduction and aims: Dietary Rational Gene Targeting (DRGT) is a therapeutic dietary strategy that uses healthy dietary agents to modulate the expression of disease-causing genes back toward the normal. Here we use the DRGT approach to (1) identify human studies assessing gene expression after ingestion of healthy dietary agents with an emphasis on whole foods, and (2) use this data to construct an online dietary guide app prototype toward eventually aiding patients, healthcare providers, community and researchers in treating and preventing numerous health conditions. Methods: We used the keywords "human", "gene expression" and separately, 51 different dietary agents with reported health benefits to search GEO, PubMed, Google Scholar, Clinical trials, Cochrane library, and EMBL-EBI databases for related studies. Studies meeting qualifying criteria were assessed for gene modulations. The R-Shiny platform was utilized to construct an interactive app called "Eat4Genes". Results: Fifty-one human ingestion studies (37 whole food related) and 96 key risk genes were identified. Human gene expression studies were found for 18 of 41 searched whole foods or extracts. App construction included the option to select either specific conditions/diseases or genes followed by food guide suggestions, key target genes, data sources and links, dietary suggestion rankings, bar chart or bubble chart visualization, optional full report, and nutrient categories. We also present user scenarios from physician and researcher perspectives. Conclusion: In conclusion, an interactive dietary guide app prototype has been constructed as a first step towards eventually translating our DRGT strategy into an innovative, low-cost, healthy, and readily translatable public resource to improve health.

10.
J Biomed Semantics ; 14(1): 8, 2023 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-37464259

RESUMO

BACKGROUND: Clinical decision support systems have been widely deployed to guide healthcare decisions on patient diagnosis, treatment choices, and patient management through evidence-based recommendations. These recommendations are typically derived from clinical practice guidelines created by clinical specialties or healthcare organizations. Although there have been many different technical approaches to encoding guideline recommendations into decision support systems, much of the previous work has not focused on enabling system generated recommendations through the formalization of changes in a guideline, the provenance of a recommendation, and applicability of the evidence. Prior work indicates that healthcare providers may not find that guideline-derived recommendations always meet their needs for reasons such as lack of relevance, transparency, time pressure, and applicability to their clinical practice. RESULTS: We introduce several semantic techniques that model diseases based on clinical practice guidelines, provenance of the guidelines, and the study cohorts they are based on to enhance the capabilities of clinical decision support systems. We have explored ways to enable clinical decision support systems with semantic technologies that can represent and link to details in related items from the scientific literature and quickly adapt to changing information from the guidelines, identifying gaps, and supporting personalized explanations. Previous semantics-driven clinical decision systems have limited support in all these aspects, and we present the ontologies and semantic web based software tools in three distinct areas that are unified using a standard set of ontologies and a custom-built knowledge graph framework: (i) guideline modeling to characterize diseases, (ii) guideline provenance to attach evidence to treatment decisions from authoritative sources, and (iii) study cohort modeling to identify relevant research publications for complicated patients. CONCLUSIONS: We have enhanced existing, evidence-based knowledge by developing ontologies and software that enables clinicians to conveniently access updates to and provenance of guidelines, as well as gather additional information from research studies applicable to their patients' unique circumstances. Our software solutions leverage many well-used existing biomedical ontologies and build upon decades of knowledge representation and reasoning work, leading to explainable results.


Assuntos
Ontologias Biológicas , Sistemas de Apoio a Decisões Clínicas , Humanos , Software , Bases de Conhecimento , Publicações
11.
RNA ; 16(5): 865-78, 2010 May.
Artigo em Inglês | MEDLINE | ID: mdl-20360393

RESUMO

The use of free energy-based algorithms to compute RNA secondary structures produces, in general, large numbers of foldings. Recent research has addressed the problem of grouping structures into a small number of clusters and computing a representative folding for each cluster. At the heart of this problem is the need to compute a quantity that measures the difference between pairs of foldings. We introduce a new concept, the relaxed base-pair (RBP) score, designed to give a more biologically realistic measure of the difference between structures than the base-pair (BP) metric, which simply counts the number of base pairs in one structure but not the other. The degree of relaxation is determined by a single relaxation parameter, t. When t = 0, (no relaxation) our method is the same as the BP metric. At the other extreme, a very large value of t will give a distance of 0 for identical structures and 1 for structures that differ. Scores can be recomputed with different values of t, at virtually no extra computation cost, to yield satisfactory results. Our results indicate that relaxed measures give more stable and more meaningful clusters than the BP metric. We also use the RBP score to compute representative foldings for each cluster.


Assuntos
Pareamento de Bases , Conformação de Ácido Nucleico , RNA/química , Algoritmos , Análise por Conglomerados , Biologia Computacional , Haloarcula/química , Haloarcula/genética , Humanos , Methanobacteriaceae/química , Methanobacteriaceae/genética , Modelos Moleculares , Filogenia , RNA/genética , Estabilidade de RNA , RNA Arqueal/química , RNA Arqueal/genética , RNA Mensageiro/química , RNA Mensageiro/genética , RNA Ribossômico 5S/química , RNA Ribossômico 5S/genética , Processos Estocásticos , Termodinâmica
12.
J Chem Inf Model ; 52(6): 1637-59, 2012 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-22524152

RESUMO

RS-Predictor is a tool for creating pathway-independent, isozyme-specific, site of metabolism (SOM) prediction models using any set of known cytochrome P450 (CYP) substrates and metabolites. Until now, the RS-Predictor method was only trained and validated on CYP 3A4 data, but in the present study, we report on the versatility the RS-Predictor modeling paradigm by creating and testing regioselectivity models for substrates of the nine most important CYP isozymes. Through curation of source literature, we have assembled 680 substrates distributed among CYPs 1A2, 2A6, 2B6, 2C19, 2C8, 2C9, 2D6, 2E1, and 3A4, the largest publicly accessible collection of P450 ligands and metabolites released to date. A comprehensive investigation into the importance of different descriptor classes for identifying the regioselectivity mediated by each isozyme is made through the generation of multiple independent RS-Predictor models for each set of isozyme substrates. Two of these models include a density functional theory (DFT) reactivity descriptor derived from SMARTCyp. Optimal combinations of RS-Predictor and SMARTCyp are shown to have stronger performance than either method alone, while also exceeding the accuracy of the commercial regioselectivity prediction methods distributed by Optibrium and Schrödinger, correctly identifying a large proportion of the metabolites in each substrate set within the top two rank-positions: 1A2 (83.0%), 2A6 (85.7%), 2B6 (82.1%), 2C19 (86.2%), 2C8 (83.8%), 2C9 (84.5%), 2D6 (85.9%), 2E1 (82.8%), 3A4 (82.3%), and merged (86.0%). Comprehensive datamining of each substrate set and careful statistical analyses of the predictions made by the different models revealed new insights into molecular features that control metabolic regioselectivity and enable accurate prospective prediction of likely SOMs.


Assuntos
Sistema Enzimático do Citocromo P-450/metabolismo , Isoenzimas/metabolismo , Especificidade por Substrato
13.
G3 (Bethesda) ; 12(9)2022 08 25.
Artigo em Inglês | MEDLINE | ID: mdl-35876788

RESUMO

Circadian rhythms broadly regulate physiological functions by tuning oscillations in the levels of mRNAs and proteins to the 24-h day/night cycle. Globally assessing which mRNAs and proteins are timed by the clock necessitates accurate recognition of oscillations in RNA and protein data, particularly in large omics data sets. Tools that employ fixed-amplitude models have previously been used to positive effect. However, the recognition of amplitude change in circadian oscillations required a new generation of analytical software to enhance the identification of these oscillations. To address this gap, we created the Pipeline for Amplitude Integration of Circadian Exploration suite. Here, we demonstrate the Pipeline for Amplitude Integration of Circadian Exploration suite's increased utility to detect circadian trends through the joint modeling of the Mus musculus macrophage transcriptome and proteome. Our enhanced detection confirmed extensive circadian posttranscriptional regulation in macrophages but highlighted that some of the reported discrepancy between mRNA and protein oscillations was due to noise in data. We further applied the Pipeline for Amplitude Integration of Circadian Exploration suite to investigate the circadian timing of noncoding RNAs, documenting extensive circadian timing of long noncoding RNAs and small nuclear RNAs, which control the recognition of mRNA in the spliceosome complex. By tracking oscillating spliceosome complex proteins using the PAICE suite, we noted that the clock broadly regulates the spliceosome, particularly the major spliceosome complex. As most of the above-noted rhythms had damped amplitude changes in their oscillations, this work highlights the importance of the PAICE suite in the thorough enumeration of oscillations in omics-scale datasets.


Assuntos
Relógios Circadianos , Spliceossomos , Animais , Relógios Circadianos/genética , Ritmo Circadiano/genética , Regulação da Expressão Gênica , Macrófagos/metabolismo , Camundongos , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , RNA não Traduzido , Spliceossomos/genética , Spliceossomos/metabolismo
14.
BMC Genomics ; 12 Suppl 2: S1, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21988942

RESUMO

BACKGROUND: Strains of Mycobacterium tuberculosis complex (MTBC) can be classified into major lineages based on their genotype. Further subdivision of major lineages into sublineages requires multiple biomarkers along with methods to combine and analyze multiple sources of information in one unsupervised learning model. Typically, spacer oligonucleotide type (spoligotype) and mycobacterial interspersed repetitive units (MIRU) are used for TB genotyping and surveillance. Here, we examine the sublineage structure of MTBC strains with multiple biomarkers simultaneously, by employing a tensor clustering framework (TCF) on multiple-biomarker tensors. RESULTS: Simultaneous analysis of the spoligotype and MIRU type of strains using TCF on multiple-biomarker tensors leads to coherent sublineages of major lineages with clear and distinctive spoligotype and MIRU signatures. Comparison of tensor sublineages with SpolDB4 families either supports tensor sublineages, or suggests subdivision or merging of SpolDB4 families. High prediction accuracy of major lineage classification with supervised tensor learning on multiple-biomarker tensors validates our unsupervised analysis of sublineages on multiple-biomarker tensors. CONCLUSIONS: TCF on multiple-biomarker tensors achieves simultaneous analysis of multiple biomarkers and suggest a new putative sublineage structure for each major lineage. Analysis of multiple-biomarker tensors gives insight into the sublineage structure of MTBC at the genomic level.


Assuntos
Biomarcadores/análise , Genoma Bacteriano , Sequências Repetitivas Dispersas , Modelos Estatísticos , Mycobacterium tuberculosis/classificação , Algoritmos , Análise por Conglomerados , Impressões Digitais de DNA/métodos , Loci Gênicos , Repetições Minissatélites , Mycobacterium tuberculosis/genética , Filogenia , Polimorfismo Genético , Deleção de Sequência
15.
J Chem Inf Model ; 51(7): 1667-89, 2011 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-21528931

RESUMO

This article describes RegioSelectivity-Predictor (RS-Predictor), a new in silico method for generating predictive models of P450-mediated metabolism for drug-like compounds. Within this method, potential sites of metabolism (SOMs) are represented as "metabolophores": A concept that describes the hierarchical combination of topological and quantum chemical descriptors needed to represent the reactivity of potential metabolic reaction sites. RS-Predictor modeling involves the use of metabolophore descriptors together with multiple-instance ranking (MIRank) to generate an optimized descriptor weight vector that encodes regioselectivity trends across all cases in a training set. The resulting pathway-independent (O-dealkylation vs N-oxidation vs Csp(3) hydroxylation, etc.), isozyme-specific regioselectivity model may be used to predict potential metabolic liabilities. In the present work, cross-validated RS-Predictor models were generated for a set of 394 substrates of CYP 3A4 as a proof-of-principle for the method. Rank aggregation was then employed to merge independently generated predictions for each substrate into a single consensus prediction. The resulting consensus RS-Predictor models were shown to reliably identify at least one observed site of metabolism in the top two rank-positions on 78% of the substrates. Comparisons between RS-Predictor and previously described regioselectivity prediction methods reveal new insights into how in silico metabolite prediction methods should be compared.


Assuntos
Citocromo P-450 CYP3A , Modelos Moleculares , Acetaminofen/química , Acetaminofen/metabolismo , Sítios de Ligação , Citocromo P-450 CYP3A/química , Citocromo P-450 CYP3A/metabolismo , Estrutura Molecular , Estereoisomerismo , Varfarina/química , Varfarina/metabolismo
16.
J Chem Inf Model ; 51(11): 2808-20, 2011 Nov 28.
Artigo em Inglês | MEDLINE | ID: mdl-21999408

RESUMO

Least-squares fitting of the Hill equation to quantitative high-throughput screening (qHTS) assays results in frequent unsatisfactory fits. We learn and exploit prior knowledge to improve the Hill fitting in a nonlinear regression method called domain knowledge fitter (DK-fitter). This paper formulates and solves DK-fitter for 44 public qHTS data sets. This new Hill parameter estimation technique is validated using three unbiased approaches, including a novel method that involves generating simulated samples. This paper fosters the extraction of higher quality information from screens for improved potency evaluation.


Assuntos
Biologia Computacional/métodos , Ensaios de Triagem em Larga Escala , Modelos Químicos , Biologia Computacional/estatística & dados numéricos , Desenho de Fármacos , Inibidores Enzimáticos/farmacologia , Piruvato Quinase/antagonistas & inibidores , Piruvato Quinase/metabolismo , Relação Quantitativa Estrutura-Atividade , Análise de Regressão
17.
JAMIA Open ; 4(3): ooab077, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-34568771

RESUMO

OBJECTIVE: We help identify subpopulations underrepresented in randomized clinical trials (RCTs) cohorts with respect to national, community-based or health system target populations by formulating population representativeness of RCTs as a machine learning (ML) fairness problem, deriving new representation metrics, and deploying them in easy-to-understand interactive visualization tools. MATERIALS AND METHODS: We represent RCT cohort enrollment as random binary classification fairness problems, and then show how ML fairness metrics based on enrollment fraction can be efficiently calculated using easily computed rates of subpopulations in RCT cohorts and target populations. We propose standardized versions of these metrics and deploy them in an interactive tool to analyze 3 RCTs with respect to type 2 diabetes and hypertension target populations in the National Health and Nutrition Examination Survey. RESULTS: We demonstrate how the proposed metrics and associated statistics enable users to rapidly examine representativeness of all subpopulations in the RCT defined by a set of categorical traits (eg, gender, race, ethnicity, smoking status, and blood pressure) with respect to target populations. DISCUSSION: The normalized metrics provide an intuitive standardized scale for evaluating representation across subgroups, which may have vastly different enrollment fractions and rates in RCT study cohorts. The metrics are beneficial complements to other approaches (eg, enrollment fractions) used to identify generalizability and health equity of RCTs. CONCLUSION: By quantifying the gaps between RCT and target populations, the proposed methods can support generalizability evaluation of existing RCT cohorts. The interactive visualization tool can be readily applied to identified underrepresented subgroups with respect to any desired source or target populations.

18.
AMIA Jt Summits Transl Sci Proc ; 2021: 555-564, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34457171

RESUMO

In this exploratory study, we scrutinize a database of over one million tweets collected from March to July 2020 to illustrate public attitudes towards mask usage during the COVID-19 pandemic. We employ natural language processing, clustering and sentiment analysis techniques to organize tweets relating to mask-wearing into high-level themes, then relay narratives for each theme using automatic text summarization. In recent months, a body of literature has highlighted the robustness of trends in online activity as proxies for the sociological impact of COVID-19. We find that topic clustering based on mask-related Twitter data offers revealing insights into societal perceptions of COVID- 19 and techniques for its prevention. We observe that the volume and polarity of mask-related tweets has greatly increased. Importantly, the analysis pipeline presented may be leveraged by the health community for qualitative assessment of public response to health intervention techniques in real time.


Assuntos
COVID-19 , Mídias Sociais , Humanos , Máscaras , Processamento de Linguagem Natural , Pandemias , SARS-CoV-2
19.
BMC Bioinformatics ; 11 Suppl 3: S4, 2010 Apr 29.
Artigo em Inglês | MEDLINE | ID: mdl-20438651

RESUMO

BACKGROUND: We present a novel conformal Bayesian network (CBN) to classify strains of Mycobacterium tuberculosis Complex (MTBC) into six major genetic lineages based on two high-throuput biomarkers: mycobacterial interspersed repetitive units (MIRU) and spacer oligonucleotide typing (spoligotyping). MTBC is the causative agent of tuberculosis (TB), which remains one of the leading causes of disease and morbidity world-wide. DNA fingerprinting methods such as MIRU and spoligotyping are key components in the control and tracking of modern TB. RESULTS: CBN is designed to exploit background knowledge about MTBC biomarkers. It can be trained on large historical TB databases of various subsets of MTBC biomarkers. During TB control efforts not all biomarkers may be available. So, CBN is designed to predict the major lineage of isolates genotyped by any combination of the PCR-based typing methods: spoligotyping and MIRU typing. CBN achieves high accuracy on three large MTBC collections consisting of over 34,737 isolates genotyped by different combinations of spoligotypes, 12 loci of MIRU, and 24 loci of MIRU. CBN captures distinct MIRU and spoligotype signatures associated with each lineage, explaining its excellent performance. Visualization of MIRU and spoligotype signatures yields insight into both how the model works and the genetic diversity of MTBC. CONCLUSIONS: CBN conforms to the available PCR-based biological markers and achieves high performance in identifying major lineages of MTBC. The method can be readily extended as new biomarkers are introduced for TB tracking and control. An online tool (http://www.cs.rpi.edu/~bennek/tbinsight/tblineage) makes the CBN model available for TB control and research efforts.


Assuntos
Teorema de Bayes , Biomarcadores/análise , Biologia Computacional/métodos , DNA Bacteriano/genética , Mycobacterium tuberculosis/classificação , Mycobacterium tuberculosis/genética , Algoritmos , Impressões Digitais de DNA/métodos , DNA Intergênico/genética , Bases de Dados Genéticas , Humanos , Internet , Sequências Repetitivas Dispersas/genética , Técnicas de Amplificação de Ácido Nucleico/métodos , Software , Tuberculose/microbiologia
20.
IEEE J Biomed Health Inform ; 24(3): 916-925, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31107669

RESUMO

We consider the problem in precision health of grouping people into subpopulations based on their degree of vulnerability to a risk factor. These subpopulations cannot be discovered with traditional clustering techniques because their quality is evaluated with a supervised metric: The ease of modeling a response variable for observations within them. Instead, we apply the more appropriate supervised cadre model (SCM). We extend the SCM formalism so that it may be applied to multivariate regression and binary classification problems and develop a way to use conditional entropy to assess the confidence in the process by which a subject is assigned their cadre. Using the SCM, we generalize the environment-wide association study (EWAS) to be able to model heterogeneity in population risk. In our EWAS, we consider more than 200 environmental exposure factors and find their association with diastolic blood pressure, systolic blood pressure, and hypertension. This requires adapting the SCM to be applicable to data generated by a complex survey design. After correcting for false positives, we found 25 exposure variables that had a significant association with at least one of our response variables. Eight of these were significant for a discovered subpopulation but not for the overall population. Some of these associations have been identified by previous researchers, whereas others appear to be novel. We examine discovered subpopulations in detail, finding that they are interpretable and suggestive of further research questions.


Assuntos
Biologia Computacional/métodos , Hipertensão/epidemiologia , Modelos Estatísticos , Aprendizado de Máquina Supervisionado , Big Data , Meio Ambiente , Humanos , Descoberta do Conhecimento , Inquéritos Nutricionais , Fatores de Risco
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA