Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
1.
J Am Med Inform Assoc ; 29(1): 97-108, 2021 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-34791282

RESUMO

OBJECTIVE: Clinical registries-structured databases of demographic, diagnosis, and treatment information-play vital roles in retrospective studies, operational planning, and assessment of patient eligibility for research, including clinical trials. Registry curation, a manual and time-intensive process, is always costly and often impossible for rare or underfunded diseases. Our goal was to evaluate the feasibility of natural language inference (NLI) as a scalable solution for registry curation. MATERIALS AND METHODS: We applied five state-of-the-art, pretrained, deep learning-based NLI models to clinical, laboratory, and pathology notes to infer information about 43 different breast oncology registry fields. Model inferences were evaluated against a manually curated, 7439 patient breast oncology research database. RESULTS: NLI models showed considerable variation in performance, both within and across fields. One model, ALBERT, outperformed the others (BART, RoBERTa, XLNet, and ELECTRA) on 22 out of 43 fields. A detailed error analysis revealed that incorrect inferences primarily arose through models' tendency to misinterpret historical findings, as well as confusion based on abbreviations and subtle term variants common in clinical text. DISCUSSION AND CONCLUSION: Traditional natural language processing methods require specially annotated training sets or the construction of a separate model for each registry field. In contrast, a single pretrained NLI model can curate dozens of different fields simultaneously. Surprisingly, NLI methods remain unexplored in the clinical domain outside the realm of shared tasks and benchmarks. Modern NLI models could increase the efficiency of registry curation, even when applied "out of the box" with no additional training.


Assuntos
Processamento de Linguagem Natural , Bases de Dados Factuais , Humanos , Sistema de Registros , Estudos Retrospectivos
2.
Annu Rev Biomed Data Sci ; 4: 165-187, 2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-34465177

RESUMO

Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g., physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, this review describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation in health systems and in industry.


Assuntos
Mineração de Dados , Médicos , Registros Eletrônicos de Saúde , Humanos , Aprendizado de Máquina , Tempo
3.
Crit Care Explor ; 3(3): e0355, 2021 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-33655216

RESUMO

Acute hypoxemic respiratory failure is the major complication of coronavirus disease 2019, yet optimal respiratory support strategies are uncertain. We aimed to describe outcomes with high-flow oxygen delivered through nasal cannula and noninvasive positive pressure ventilation in coronavirus disease 2019 acute hypoxemic respiratory failure and identify individual factors associated with noninvasive respiratory support failure. DESIGN: Retrospective cohort study to describe rates of high-flow oxygen delivered through nasal cannula and/or noninvasive positive pressure ventilation success (live discharge without endotracheal intubation). Fine-Gray subdistribution hazard models were used to identify patient characteristics associated with high-flow oxygen delivered through nasal cannula and/or noninvasive positive pressure ventilation failure (endotracheal intubation and/or in-hospital mortality). SETTING: One large academic health system, including five hospitals (one quaternary referral center, a tertiary hospital, and three community hospitals), in New York City. PATIENTS: All hospitalized adults 18-100 years old with coronavirus disease 2019 admitted between March 1, 2020, and April 28, 2020. INTERVENTIONS: None. MEASUREMENTS AND MAIN RESULTS: A total of 331 and 747 patients received high-flow oxygen delivered through nasal cannula and noninvasive positive pressure ventilation as the highest level of noninvasive respiratory support, respectively; 154 (46.5%) in the high-flow oxygen delivered through nasal cannula cohort and 167 (22.4%) in the noninvasive positive pressure ventilation cohort were successfully discharged without requiring endotracheal intubation. In adjusted models, significantly increased risk of high-flow oxygen delivered through nasal cannula and noninvasive positive pressure ventilation failure was seen among patients with cardiovascular disease (subdistribution hazard ratio, 1.82; 95% CI, 1.17-2.83 and subdistribution hazard ratio, 1.40; 95% CI, 1.06-1.84, respectively). Conversely, a higher peripheral blood oxygen saturation to Fio2 ratio at high-flow oxygen delivered through nasal cannula and noninvasive positive pressure ventilation initiation was associated with reduced risk of failure (subdistribution hazard ratio, 0.32; 95% CI, 0.19-0.54, and subdistribution hazard ratio 0.34; 95% CI, 0.21-0.55, respectively). CONCLUSIONS: A significant proportion of patients receiving noninvasive respiratory modalities for coronavirus disease 2019 acute hypoxemic respiratory failure achieved successful hospital discharge without requiring endotracheal intubation, with lower success rates among those with comorbid cardiovascular disease or more severe hypoxemia. The role of high-flow oxygen delivered through nasal cannula and noninvasive positive pressure ventilation in coronavirus disease 2019-related acute hypoxemic respiratory failure warrants further consideration.

4.
Genet Med ; 23(3): 576-580, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33060835

RESUMO

PURPOSE: Rare genetic conditions like Down syndrome (DS) are historically understudied. Infection is a leading cause of mortality in DS, along with cardiac anomalies. Currently, it is unknown how the COVID-19 pandemic affects individuals with DS. Herein, we report an analysis of individuals with DS who were hospitalized with COVID-19 in New York, New York, USA. METHODS: In this retrospective, dual-center study of 7246 patients hospitalized with COVID-19, we analyzed all patients with DS admitted in the Mount Sinai Health System and Columbia University Irving Medical Center. We assessed hospitalization rates, clinical characteristics, and outcomes. RESULTS: We identified 12 patients with DS. Hospitalized individuals with DS are on average ten years younger than patients without DS. Patients with DS have more severe disease than controls, particularly an increased incidence of sepsis and mechanical ventilation. CONCLUSION: We demonstrate that individuals with DS who are hospitalized with COVID-19 are younger than their non-DS counterparts, and that they have more severe disease than age-matched controls. We conclude that particular care should be considered for both the prevention and treatment of COVID-19 in these patients.


Assuntos
COVID-19/patologia , Síndrome de Down , Adulto , Comorbidade , Síndrome de Down/complicações , Feminino , Hospitalização , Humanos , Masculino , Pessoa de Meia-Idade , New York/epidemiologia , Pandemias , Estudos Retrospectivos
5.
J Med Internet Res ; 22(11): e24018, 2020 11 06.
Artigo em Inglês | MEDLINE | ID: mdl-33027032

RESUMO

BACKGROUND: COVID-19 has infected millions of people worldwide and is responsible for several hundred thousand fatalities. The COVID-19 pandemic has necessitated thoughtful resource allocation and early identification of high-risk patients. However, effective methods to meet these needs are lacking. OBJECTIVE: The aims of this study were to analyze the electronic health records (EHRs) of patients who tested positive for COVID-19 and were admitted to hospitals in the Mount Sinai Health System in New York City; to develop machine learning models for making predictions about the hospital course of the patients over clinically meaningful time horizons based on patient characteristics at admission; and to assess the performance of these models at multiple hospitals and time points. METHODS: We used Extreme Gradient Boosting (XGBoost) and baseline comparator models to predict in-hospital mortality and critical events at time windows of 3, 5, 7, and 10 days from admission. Our study population included harmonized EHR data from five hospitals in New York City for 4098 COVID-19-positive patients admitted from March 15 to May 22, 2020. The models were first trained on patients from a single hospital (n=1514) before or on May 1, externally validated on patients from four other hospitals (n=2201) before or on May 1, and prospectively validated on all patients after May 1 (n=383). Finally, we established model interpretability to identify and rank variables that drive model predictions. RESULTS: Upon cross-validation, the XGBoost classifier outperformed baseline models, with an area under the receiver operating characteristic curve (AUC-ROC) for mortality of 0.89 at 3 days, 0.85 at 5 and 7 days, and 0.84 at 10 days. XGBoost also performed well for critical event prediction, with an AUC-ROC of 0.80 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days. In external validation, XGBoost achieved an AUC-ROC of 0.88 at 3 days, 0.86 at 5 days, 0.86 at 7 days, and 0.84 at 10 days for mortality prediction. Similarly, the unimputed XGBoost model achieved an AUC-ROC of 0.78 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days. Trends in performance on prospective validation sets were similar. At 7 days, acute kidney injury on admission, elevated LDH, tachypnea, and hyperglycemia were the strongest drivers of critical event prediction, while higher age, anion gap, and C-reactive protein were the strongest drivers of mortality prediction. CONCLUSIONS: We externally and prospectively trained and validated machine learning models for mortality and critical events for patients with COVID-19 at different time horizons. These models identified at-risk patients and uncovered underlying relationships that predicted outcomes.


Assuntos
Infecções por Coronavirus/diagnóstico , Infecções por Coronavirus/mortalidade , Aprendizado de Máquina/normas , Pneumonia Viral/diagnóstico , Pneumonia Viral/mortalidade , Injúria Renal Aguda/epidemiologia , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Betacoronavirus , COVID-19 , Estudos de Coortes , Registros Eletrônicos de Saúde , Feminino , Mortalidade Hospitalar , Hospitalização/estatística & dados numéricos , Hospitais , Humanos , Masculino , Pessoa de Meia-Idade , Cidade de Nova Iorque/epidemiologia , Pandemias , Prognóstico , Curva ROC , Medição de Risco/métodos , Medição de Risco/normas , SARS-CoV-2 , Adulto Jovem
6.
JMIR Med Inform ; 8(2): e16878, 2020 Feb 27.
Artigo em Inglês | MEDLINE | ID: mdl-32130159

RESUMO

BACKGROUND: Acute and chronic low back pain (LBP) are different conditions with different treatments. However, they are coded in electronic health records with the same International Classification of Diseases, 10th revision (ICD-10) code (M54.5) and can be differentiated only by retrospective chart reviews. This prevents an efficient definition of data-driven guidelines for billing and therapy recommendations, such as return-to-work options. OBJECTIVE: The objective of this study was to evaluate the feasibility of automatically distinguishing acute LBP episodes by analyzing free-text clinical notes. METHODS: We used a dataset of 17,409 clinical notes from different primary care practices; of these, 891 documents were manually annotated as acute LBP and 2973 were generally associated with LBP via the recorded ICD-10 code. We compared different supervised and unsupervised strategies for automated identification: keyword search, topic modeling, logistic regression with bag of n-grams and manual features, and deep learning (a convolutional neural network-based architecture [ConvNet]). We trained the supervised models using either manual annotations or ICD-10 codes as positive labels. RESULTS: ConvNet trained using manual annotations obtained the best results with an area under the receiver operating characteristic curve of 0.98 and an F score of 0.70. ConvNet's results were also robust to reduction of the number of manually annotated documents. In the absence of manual annotations, topic models performed better than methods trained using ICD-10 codes, which were unsatisfactory for identifying LBP acuity. CONCLUSIONS: This study uses clinical notes to delineate a potential path toward systematic learning of therapeutic strategies, billing guidelines, and management options for acute LBP at the point of care.

7.
Sensors (Basel) ; 20(5)2020 Mar 03.
Artigo em Inglês | MEDLINE | ID: mdl-32138289

RESUMO

Sleep quality has been directly linked to cognitive function, quality of life, and a variety of serious diseases across many clinical domains. Standard methods for assessing sleep involve overnight studies in hospital settings, which are uncomfortable, expensive, not representative of real sleep, and difficult to conduct on a large scale. Recently, numerous commercial digital devices have been developed that record physiological data, such as movement, heart rate, and respiratory rate, which can act as a proxy for sleep quality in lieu of standard electroencephalogram recording equipment. The sleep-related output metrics from these devices include sleep staging and total sleep duration and are derived via proprietary algorithms that utilize a variety of these physiological recordings. Each device company makes different claims of accuracy and measures different features of sleep quality, and it is still unknown how well these devices correlate with one another and perform in a research setting. In this pilot study of 21 participants, we investigated whether sleep metric outputs from self-reported sleep metrics (SRSMs) and four sensors, specifically Fitbit Surge (a smart watch), Withings Aura (a sensor pad that is placed under a mattress), Hexoskin (a smart shirt), and Oura Ring (a smart ring), were related to known cognitive and psychological metrics, including the n-back test and Pittsburgh Sleep Quality Index (PSQI). We analyzed correlation between multiple device-related sleep metrics. Furthermore, we investigated relationships between these sleep metrics and cognitive scores across different timepoints and SRSM through univariate linear regressions. We found that correlations for sleep metrics between the devices across the sleep cycle were almost uniformly low, but still significant (P < 0.05). For cognitive scores, we found the Withings latency was statistically significant for afternoon and evening timepoints at P = 0.016 and P = 0.013. We did not find any significant associations between SRSMs and PSQI or cognitive scores. Additionally, Oura Ring's total sleep duration and efficiency in relation to the PSQI measure was statistically significant at P = 0.004 and P = 0.033, respectively. These findings can hopefully be used to guide future sensor-based sleep research.


Assuntos
Meio Ambiente , Sono/fisiologia , Adulto , Cognição , Feminino , Humanos , Masculino , Projetos Piloto , Autorrelato , Fases do Sono/fisiologia , Adulto Jovem
8.
JMIR Res Protoc ; 9(1): e16362, 2020 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-31913135

RESUMO

BACKGROUND: N-of-1 trials promise to help individuals make more informed decisions about treatment selection through structured experiments that compare treatment effectiveness by alternating treatments and measuring their impacts in a single individual. We created a digital platform that automates the design, administration, and analysis of N-of-1 trials. Our first N-of-1 trial, the app-based Brain Boost Study, invited individuals to compare the impacts of two commonly consumed substances (caffeine and L-theanine) on their cognitive performance. OBJECTIVE: The purpose of this study is to evaluate critical factors that may impact the completion of N-of-1 trials to inform the design of future app-based N-of-1 trials. We will measure study completion rates for participants that begin the Brain Boost Study and assess their associations with study duration (5, 15, or 27 days) and notification level (light or moderate). METHODS: Participants will be randomized into three study durations and two notification levels. To sufficiently power the study, a minimum of 640 individuals must begin the study, and 97 individuals must complete the study. We will use a multiple logistic regression model to discern whether the study length and notification level are associated with the rate of study completion. For each group, we will also compare participant adherence and the proportion of trials that yield statistically meaningful results. RESULTS: We completed the beta testing of the N1 app on a convenience sample of users. The Brain Boost Study on the N1 app opened enrollment to the public in October 2019. More than 30 participants enrolled in the first month. CONCLUSIONS: To our knowledge, this will be the first study to rigorously evaluate critical factors associated with study completion in the context of app-based N-of-1 trials. TRIAL REGISTRATION: ClinicalTrials.gov NCT04056650; https://clinicaltrials.gov/ct2/show/NCT04056650. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): PRR1-10.2196/16362.

10.
NPJ Digit Med ; 2: 31, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31304378

RESUMO

Hip fractures are a leading cause of death and disability among older adults. Hip fractures are also the most commonly missed diagnosis on pelvic radiographs, and delayed diagnosis leads to higher cost and worse outcomes. Computer-aided diagnosis (CAD) algorithms have shown promise for helping radiologists detect fractures, but the image features underpinning their predictions are notoriously difficult to understand. In this study, we trained deep-learning models on 17,587 radiographs to classify fracture, 5 patient traits, and 14 hospital process variables. All 20 variables could be individually predicted from a radiograph, with the best performances on scanner model (AUC = 1.00), scanner brand (AUC = 0.98), and whether the order was marked "priority" (AUC = 0.79). Fracture was predicted moderately well from the image (AUC = 0.78) and better when combining image features with patient data (AUC = 0.86, DeLong paired AUC comparison, p = 2e-9) or patient data plus hospital process features (AUC = 0.91, p = 1e-21). Fracture prediction on a test set that balanced fracture risk across patient variables was significantly lower than a random test set (AUC = 0.67, DeLong unpaired AUC comparison, p = 0.003); and on a test set with fracture risk balanced across patient and hospital process variables, the model performed randomly (AUC = 0.52, 95% CI 0.46-0.58), indicating that these variables were the main source of the model's fracture predictions. A single model that directly combines image features, patient, and hospital process data outperforms a Naive Bayes ensemble of an image-only model prediction, patient, and hospital process data. If CAD algorithms are inexplicably leveraging patient and process variables in their predictions, it is unclear how radiologists should interpret their predictions in the context of other known patient data. Further research is needed to illuminate deep-learning decision processes so that computers and clinicians can effectively cooperate.

11.
Bioinformatics ; 35(21): 4515-4518, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-31214700

RESUMO

MOTIVATION: Electronic health records (EHRs) are quickly becoming omnipresent in healthcare, but interoperability issues and technical demands limit their use for biomedical and clinical research. Interactive and flexible software that interfaces directly with EHR data structured around a common data model (CDM) could accelerate more EHR-based research by making the data more accessible to researchers who lack computational expertise and/or domain knowledge. RESULTS: We present PatientExploreR, an extensible application built on the R/Shiny framework that interfaces with a relational database of EHR data in the Observational Medical Outcomes Partnership CDM format. PatientExploreR produces patient-level interactive and dynamic reports and facilitates visualization of clinical data without any programming required. It allows researchers to easily construct and export patient cohorts from the EHR for analysis with other software. This application could enable easier exploration of patient-level data for physicians and researchers. PatientExploreR can incorporate EHR data from any institution that employs the CDM for users with approved access. The software code is free and open source under the MIT license, enabling institutions to install and users to expand and modify the application for their own purposes. AVAILABILITY AND IMPLEMENTATION: PatientExploreR can be freely obtained from GitHub: https://github.com/BenGlicksberg/PatientExploreR. We provide instructions for how researchers with approved access to their institutional EHR can use this package. We also release an open sandbox server of synthesized patient data for users without EHR access to explore: http://patientexplorer.ucsf.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Registros Eletrônicos de Saúde , Software , Computadores , Bases de Dados Factuais , Humanos , Estudos Observacionais como Assunto
12.
J Med Internet Res ; 21(4): e12641, 2019 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-30932871

RESUMO

BACKGROUND: Recent advances in molecular biology, sensors, and digital medicine have led to an explosion of products and services for high-resolution monitoring of individual health. The N-of-1 study has emerged as an important methodological tool for harnessing these new data sources, enabling researchers to compare the effectiveness of health interventions at the level of a single individual. OBJECTIVE: N-of-1 studies are susceptible to several design flaws. We developed a model that generates realistic data for N-of-1 studies to enable researchers to optimize study designs in advance. METHODS: Our stochastic time-series model simulates an N-of-1 study, incorporating all study-relevant effects, such as carryover and wash-in effects, as well as various sources of noise. The model can be used to produce realistic simulated data for a near-infinite number of N-of-1 study designs, treatment profiles, and patient characteristics. RESULTS: Using simulation, we demonstrate how the number of treatment blocks, ordering of treatments within blocks, duration of each treatment, and sampling frequency affect our ability to detect true differences in treatment efficacy. We provide a set of recommendations for study designs on the basis of treatment, outcomes, and instrument parameters, and make our simulation software publicly available for use by the precision medicine community. CONCLUSIONS: Simulation can facilitate rapid optimization of N-of-1 study designs and increase the likelihood of study success while minimizing participant burden.


Assuntos
Simulação por Computador/normas , Medicina de Precisão/métodos , Humanos , Projetos de Pesquisa
13.
Bioinformatics ; 34(15): 2614-2624, 2018 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-29490008

RESUMO

Motivation: The biomedical community's collective understanding of how chemicals, genes and phenotypes interact is distributed across the text of over 24 million research articles. These interactions offer insights into the mechanisms behind higher order biochemical phenomena, such as drug-drug interactions and variations in drug response across individuals. To assist their curation at scale, we must understand what relationship types are possible and map unstructured natural language descriptions onto these structured classes. We used NCBI's PubTator annotations to identify instances of chemical, gene and disease names in Medline abstracts and applied the Stanford dependency parser to find connecting dependency paths between pairs of entities in single sentences. We combined a published ensemble biclustering algorithm (EBC) with hierarchical clustering to group the dependency paths into semantically-related categories, which we annotated with labels, or 'themes' ('inhibition' and 'activation', for example). We evaluated our theme assignments against six human-curated databases: DrugBank, Reactome, SIDER, the Therapeutic Target Database, OMIM and PharmGKB. Results: Clustering revealed 10 broad themes for chemical-gene relationships, 7 for chemical-disease, 10 for gene-disease and 9 for gene-gene. In most cases, enriched themes corresponded directly to known database relationships. Our final dataset, represented as a network, contained 37 491 thematically-labeled chemical-gene edges, 2 021 192 chemical-disease edges, 136 206 gene-disease edges and 41 418 gene-gene edges, each representing a single-sentence description of an interaction from somewhere in the literature. Availability and implementation: The complete network is available on Zenodo (https://zenodo.org/record/1035500). We have also provided the full set of dependency paths connecting biomedical entities in Medline abstracts, with associated sentences, for future use by the biomedical research community. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Mineração de Dados/métodos , MEDLINE , Vocabulário Controlado , Variação Biológica da População , Interações Medicamentosas , Humanos , Semântica
14.
J Am Med Inform Assoc ; 25(6): 679-685, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29329435

RESUMO

Objective: Distributional semantics algorithms, which learn vector space representations of words and phrases from large corpora, identify related terms based on contextual usage patterns. We hypothesize that distributional semantics can speed up lexicon expansion in a clinical domain, radiology, by unearthing synonyms from the corpus. Materials and Methods: We apply word2vec, a distributional semantics software package, to the text of radiology notes to identify synonyms for RadLex, a structured lexicon of radiology terms. We stratify performance by term category, term frequency, number of tokens in the term, vector magnitude, and the context window used in vector building. Results: Ranking candidates based on distributional similarity to a target term results in high curation efficiency: on a ranked list of 775 249 terms, >50% of synonyms occurred within the first 25 terms. Synonyms are easier to find if the target term is a phrase rather than a single word, if it occurs at least 100× in the corpus, and if its vector magnitude is between 4 and 5. Some RadLex categories, such as anatomical substances, are easier to identify synonyms for than others. Discussion: The unstructured text of clinical notes contains a wealth of information about human diseases and treatment patterns. However, searching and retrieving information from clinical notes often suffer due to variations in how similar concepts are described in the text. Biomedical lexicons address this challenge, but are expensive to produce and maintain. Distributional semantics algorithms can assist lexicon curation, saving researchers time and money.


Assuntos
Mineração de Dados/métodos , Processamento de Linguagem Natural , Radiologia/classificação , Semântica , Vocabulário Controlado , Algoritmos , Bases de Dados Factuais , Registros Eletrônicos de Saúde , Humanos , Sistemas de Informação em Radiologia , Software
15.
PLoS Comput Biol ; 11(7): e1004216, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-26219079

RESUMO

The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.


Assuntos
Mineração de Dados/métodos , Interações Medicamentosas , MEDLINE , Aprendizado de Máquina , Processamento de Linguagem Natural , Farmacogenética/métodos , Algoritmos , Reconhecimento Automatizado de Padrão/métodos , Vocabulário Controlado
16.
J Am Med Inform Assoc ; 22(1): 121-31, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25336595

RESUMO

OBJECTIVE: The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. MATERIALS: We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. RESULTS: There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. CONCLUSIONS: For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice.


Assuntos
Mineração de Dados/métodos , Bases de Dados como Assunto , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Inteligência Artificial , Interações Medicamentosas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Obesidade
17.
Blood ; 124(14): 2298-305, 2014 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-25079360

RESUMO

The anticoagulant warfarin has >30 million prescriptions per year in the United States. Doses can vary 20-fold between patients, and incorrect dosing can result in serious adverse events. Variation in warfarin pharmacokinetic and pharmacodynamic genes, such as CYP2C9 and VKORC1, do not fully explain the dose variability in African Americans. To identify additional genetic contributors to warfarin dose, we exome sequenced 103 African Americans on stable doses of warfarin at extremes (≤ 35 and ≥ 49 mg/week). We found an association between lower warfarin dose and a population-specific regulatory variant, rs7856096 (P = 1.82 × 10(-8), minor allele frequency = 20.4%), in the folate homeostasis gene folylpolyglutamate synthase (FPGS). We replicated this association in an independent cohort of 372 African American subjects whose stable warfarin doses represented the full dosing spectrum (P = .046). In a combined cohort, adding rs7856096 to the International Warfarin Pharmacogenetic Consortium pharmacogenetic dosing algorithm resulted in a 5.8 mg/week (P = 3.93 × 10(-5)) decrease in warfarin dose for each allele carried. The variant overlaps functional elements and was associated (P = .01) with FPGS gene expression in lymphoblastoid cell lines derived from combined HapMap African populations (N = 326). Our results provide the first evidence linking genetic variation in folate homeostasis to warfarin response.


Assuntos
Anticoagulantes/administração & dosagem , Negro ou Afro-Americano/genética , Ácido Fólico/metabolismo , Homeostase , Varfarina/administração & dosagem , Algoritmos , Alelos , Estudos de Coortes , Exoma , Geografia , Haplótipos , Humanos , Farmacogenética , Fenótipo , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Análise de Sequência de DNA
18.
J Am Med Inform Assoc ; 20(e2): e297-305, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-23956017

RESUMO

OBJECTIVE: Mental illness is the leading cause of disability in the USA, but boundaries between different mental illnesses are notoriously difficult to define. Electronic medical records (EMRs) have recently emerged as a powerful new source of information for defining the phenotypic signatures of specific diseases. We investigated how EMR-based text mining and statistical analysis could elucidate the phenotypic boundaries of three important neuropsychiatric illnesses-autism, bipolar disorder, and schizophrenia. METHODS: We analyzed the medical records of over 7000 patients at two facilities using an automated text-processing pipeline to annotate the clinical notes with Unified Medical Language System codes and then searching for enriched codes, and associations among codes, that were representative of the three disorders. We used dimensionality-reduction techniques on individual patient records to understand individual-level phenotypic variation within each disorder, as well as the degree of overlap among disorders. RESULTS: We demonstrate that automated EMR mining can be used to extract relevant drugs and phenotypes associated with neuropsychiatric disorders and characteristic patterns of associations among them. Patient-level analyses suggest a clear separation between autism and the other disorders, while revealing significant overlap between schizophrenia and bipolar disorder. They also enable localization of individual patients within the phenotypic 'landscape' of each disorder. CONCLUSIONS: Because EMRs reflect the realities of patient care rather than idealized conceptualizations of disease states, we argue that automated EMR mining can help define the boundaries between different mental illnesses, facilitate cohort building for clinical and genomic studies, and reveal how clear expert-defined disease boundaries are in practice.


Assuntos
Transtorno Autístico/diagnóstico , Transtorno Bipolar/diagnóstico , Mineração de Dados , Registros Eletrônicos de Saúde , Fenótipo , Esquizofrenia/diagnóstico , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , Transtorno Autístico/genética , Transtorno Bipolar/genética , Criança , Pré-Escolar , Diagnóstico Diferencial , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Psicotrópicos/uso terapêutico , Esquizofrenia/genética , Unified Medical Language System , Adulto Jovem
19.
J Theor Biol ; 334: 187-99, 2013 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-23747524

RESUMO

Health-care associated infections are a major problem in our society, accounting for tens of thousands of patient deaths and millions of dollars in wasted health care expenditures each year. Many of these infections are caused by bacteria that are transmitted from patient to patient either through direct contact or via the hands or clothing of health care workers. Because of the complexity of bacterial transmission routes in health care settings, computational approaches are essential, though often analytically intractable. Here we describe the construction and detailed analysis of a model for bacterial transmission in health care settings. Our model includes both colonization and disease stages for patients and health care workers, as well as an isolation ward and both patient-patient and patient-HCW-patient transmission pathways. We explicitly derive the basic reproductive ratio for this complex model, a nine-term expression that contains all nine ways with which a new colonization can occur. Using key parameters found in the medical literature, we use our model to gain insight into the relative importance of various bacterial transmission pathways within health care facilities, and to identify which forms of interventions are likely to prove most effective in hospitals and long-term care settings. We show that analytical and numerical approaches can complement each other as we seek to untangle the complex web of interactions that occur within a health care facility.


Assuntos
Infecção Hospitalar/transmissão , Staphylococcus aureus Resistente à Meticilina/isolamento & purificação , Modelos Biológicos , Infecções Estafilocócicas/transmissão , Algoritmos , Simulação por Computador , Infecção Hospitalar/microbiologia , Infecção Hospitalar/prevenção & controle , Interações Hospedeiro-Patógeno , Humanos , Controle de Infecções/métodos , Transmissão de Doença Infecciosa do Paciente para o Profissional/prevenção & controle , Staphylococcus aureus Resistente à Meticilina/fisiologia , Infecções Estafilocócicas/microbiologia , Infecções Estafilocócicas/prevenção & controle
20.
Trends Pharmacol Sci ; 34(3): 178-84, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23414686

RESUMO

Drug-drug interactions (DDIs) are an emerging threat to public health. Recent estimates indicate that DDIs cause nearly 74000 emergency room visits and 195000 hospitalizations each year in the USA. Current approaches to DDI discovery, which include Phase IV clinical trials and post-marketing surveillance, are insufficient for detecting many DDIs and do not alert the public to potentially dangerous DDIs before a drug enters the market. Recent work has applied state-of-the-art computational and statistical methods to the problem of DDIs. Here we review recent developments that encompass a range of informatics approaches in this domain, from the construction of databases for efficient searching of known DDIs to the prediction of novel DDIs based on data from electronic medical records, adverse event reports, scientific abstracts, and other sources. We also explore why DDIs are so difficult to detect and what the future holds for informatics-based approaches to DDI discovery.


Assuntos
Interações Medicamentosas , Informática , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...