Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 203
Filtrar
1.
Stud Health Technol Inform ; 318: 18-23, 2024 Sep 24.
Artículo en Inglés | MEDLINE | ID: mdl-39320175

RESUMEN

While Fast Healthcare Interoperability Resources (FHIR) clinical terminology server enables quick and easy search and retrieval of coded medical data, it still has some drawbacks. When searching, any typographical errors, variations in word forms, or deviations in word sequence might lead to incorrect search outcomes. For retrieval, queries to the server must strictly follow the FHIR application programming interface format, which requires users to know the syntax and remember the attribute codes they wish to retrieve. To improve its functionalities, a natural language interface was built, that harnesses the capabilities of two preeminent large language models, along with other cutting-edge technologies such as speech-to-text conversion, vector semantic searching, and conversational artificial intelligence. Preliminary evaluation shows promising results in building a natural language interface for the FHIR clinical terminology system.


Asunto(s)
Procesamiento de Lenguaje Natural , Interfaz Usuario-Computador , Terminología como Asunto , Interoperabilidad de la Información en Salud , Vocabulario Controlado , Almacenamiento y Recuperación de la Información/métodos , Humanos , Registros Electrónicos de Salud/clasificación , Semántica , Inteligencia Artificial
2.
Stud Health Technol Inform ; 316: 834-838, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176922

RESUMEN

Digital individual participant data (IPD) from clinical trials are increasingly distributed for potential scientific reuse. The identification of available IPD, however, requires interpretations of textual data-sharing statements (DSS) in large databases. Recent advancements in computational linguistics include pre-trained language models that promise to simplify the implementation of effective classifiers based on textual inputs. In a subset of 5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers based on domain-specific pre-trained language models reproduce original availability categories as well as manually annotated labels. Typical metrics indicate that classifiers that predicted manual annotations outperformed those that learned to output the original availability categories. This suggests that the textual DSS descriptions contain applicable information that the availability categories do not, and that such classifiers could thus aid the automatic identification of available IPD in large trial databases.


Asunto(s)
Ensayos Clínicos como Asunto , Difusión de la Información , Humanos , Procesamiento de Lenguaje Natural , Registros Electrónicos de Salud/clasificación
3.
Stud Health Technol Inform ; 316: 267-271, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176724

RESUMEN

Automatic alerting systems (AASs) can identify adverse health events but emergency communication relies on human operators and natural languages. For complete automation, we need to code the diversity of adverse events in a granularity that supports optimal dispatches. Hence, AAs shall integrate with the International Classification of Diseases (ICD). The ICD-11 coding system includes chapters for external causes of injury. However, ICD-11 supports coding injury incidents in electronic health records (EHRs) after they have occurred, while disregarding integrating real-time injury reporting within its framework. We explore the potential challenges associated with integrating ICD-11 into AAS by analyzing external causes of morbidity or mortality and the dimensions of external causes as potential areas of integration. We recognize the themes: (i) incident of injury, (ii) mode of transport, (iii) indoor location, (iv) outdoor location, and (v) type of building, and identify four challenges: (i) conceptual differences between the two systems, (ii) injury identification, (iii) presence of entities below the shoreline in ICD-11, and (iv) lack of specificity in certain ICD-11 codes related to AASs. For easy integration of ICD-11 into AASs, we recommend an AAS data dictionary and propose ICD-11 updates related to external causes of injury.


Asunto(s)
Registros Electrónicos de Salud , Clasificación Internacional de Enfermedades , Registros Electrónicos de Salud/clasificación , Humanos , Integración de Sistemas
4.
Stud Health Technol Inform ; 316: 695-699, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176890

RESUMEN

Annotated language resources derived from clinical routine documentation form an intriguing asset for secondary use case scenarios. In this investigation, we report on how such a resource can be leveraged to identify additional term candidates for a chosen set of ICD-10 codes. We conducted a log-likelihood analysis, considering the co-occurrence of approximately 1.9 million de-identified ICD-10 codes alongside corresponding brief textual entries from problem lists in German. This analysis aimed to identify potential candidates with statistical significance set at p < 0.01, which were used as seed terms to harvest additional candidates by interfacing to a large language model in a second step. The proposed approach can identify additional term candidates at suitable performance values: hypernyms MAP@5=0.801, synonyms MAP@5 = 0.723 and hyponyms MAP@5 = 0.507. The re-use of existing annotated clinical datasets, in combination with large language models, presents an interesting strategy to bridge the lexical gap in standardized clinical terminologies and real-world jargon.


Asunto(s)
Clasificación Internacional de Enfermedades , Procesamiento de Lenguaje Natural , Vocabulario Controlado , Humanos , Terminología como Asunto , Registros Electrónicos de Salud/clasificación , Alemania
5.
Stud Health Technol Inform ; 316: 846-850, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176925

RESUMEN

Text classification plays an essential role in the medical domain by organizing and categorizing vast amounts of textual data through machine learning (ML) and deep learning (DL). The adoption of Artificial Intelligence (AI) technologies in healthcare has raised concerns about the interpretability of AI models, often perceived as "black boxes." Explainable AI (XAI) techniques aim to mitigate this issue by elucidating AI model decision-making process. In this paper, we present a scoping review exploring the application of different XAI techniques in medical text classification, identifying two main types: model-specific and model-agnostic methods. Despite some positive feedback from developers, formal evaluations with medical end users of these techniques remain limited. The review highlights the necessity for further research in XAI to enhance trust and transparency in AI-driven decision-making processes in healthcare.


Asunto(s)
Inteligencia Artificial , Procesamiento de Lenguaje Natural , Humanos , Aprendizaje Automático , Registros Electrónicos de Salud/clasificación , Aprendizaje Profundo
6.
Stud Health Technol Inform ; 316: 1008-1012, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176961

RESUMEN

Coding according to the International Classification of Diseases (ICD)-10 and its clinical modifications (CM) is inherently complex and expensive. Natural Language Processing (NLP) assists by simplifying the analysis of unstructured data from electronic health records, thereby facilitating diagnosis coding. This study investigates the suitability of transformer models for ICD-10 classification, considering both encoder and encoder-decoder architectures. The analysis is performed on clinical discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset, which contains an extensive collection of electronic health records. Pre-trained models such as BioBERT, ClinicalBERT, ClinicalLongformer, and ClinicalBigBird are adapted for the coding task, incorporating specific preprocessing techniques to enhance performance. The findings indicate that increasing context length improves accuracy, and that the difference in accuracy between encoder and encoder-decoder models is negligible.


Asunto(s)
Registros Electrónicos de Salud , Clasificación Internacional de Enfermedades , Procesamiento de Lenguaje Natural , Registros Electrónicos de Salud/clasificación , Humanos , Codificación Clínica
7.
Stud Health Technol Inform ; 315: 368-372, 2024 Jul 24.
Artículo en Inglés | MEDLINE | ID: mdl-39049285

RESUMEN

This paper explores the balance between fairness and performance in machine learning classification, predicting the likelihood of a patient receiving anti-microbial treatment using structured data in community nursing wound care electronic health records. The data includes two important predictors (gender and language) of the social determinants of health, which we used to evaluate the fairness of the classifiers. At the same time, the impact of various groupings of language codes on classifiers' performance and fairness is analyzed. Most common statistical learning-based classifiers are evaluated. The findings indicate that while K-Nearest Neighbors offers the best fairness metrics among different grouping settings, the performance of all classifiers is generally consistent across different language code groupings. Also, grouping more variables tends to improve the fairness metrics over all classifiers while maintaining their performance.


Asunto(s)
Registros Electrónicos de Salud , Equidad en Salud , Aprendizaje Automático , Registros Electrónicos de Salud/clasificación , Humanos , Determinantes Sociales de la Salud
8.
J Am Med Inform Assoc ; 31(7): 1493-1502, 2024 Jun 20.
Artículo en Inglés | MEDLINE | ID: mdl-38742455

RESUMEN

BACKGROUND: Error analysis plays a crucial role in clinical concept extraction, a fundamental subtask within clinical natural language processing (NLP). The process typically involves a manual review of error types, such as contextual and linguistic factors contributing to their occurrence, and the identification of underlying causes to refine the NLP model and improve its performance. Conducting error analysis can be complex, requiring a combination of NLP expertise and domain-specific knowledge. Due to the high heterogeneity of electronic health record (EHR) settings across different institutions, challenges may arise when attempting to standardize and reproduce the error analysis process. OBJECTIVES: This study aims to facilitate a collaborative effort to establish common definitions and taxonomies for capturing diverse error types, fostering community consensus on error analysis for clinical concept extraction tasks. MATERIALS AND METHODS: We iteratively developed and evaluated an error taxonomy based on existing literature, standards, real-world data, multisite case evaluations, and community feedback. The finalized taxonomy was released in both .dtd and .owl formats at the Open Health Natural Language Processing Consortium. The taxonomy is compatible with several different open-source annotation tools, including MAE, Brat, and MedTator. RESULTS: The resulting error taxonomy comprises 43 distinct error classes, organized into 6 error dimensions and 4 properties, including model type (symbolic and statistical machine learning), evaluation subject (model and human), evaluation level (patient, document, sentence, and concept), and annotation examples. Internal and external evaluations revealed strong variations in error types across methodological approaches, tasks, and EHR settings. Key points emerged from community feedback, including the need to enhancing clarity, generalizability, and usability of the taxonomy, along with dissemination strategies. CONCLUSION: The proposed taxonomy can facilitate the acceleration and standardization of the error analysis process in multi-site settings, thus improving the provenance, interpretability, and portability of NLP models. Future researchers could explore the potential direction of developing automated or semi-automated methods to assist in the classification and standardization of error analysis.


Asunto(s)
Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Registros Electrónicos de Salud/clasificación , Humanos , Clasificación/métodos , Errores Médicos/clasificación
9.
J Med Internet Res ; 26: e47682, 2024 May 31.
Artículo en Inglés | MEDLINE | ID: mdl-38820575

RESUMEN

The health sector is highly digitized, which is enabling the collection of vast quantities of electronic data about health and well-being. These data are collected by a diverse array of information and communication technologies, including systems used by health care organizations, consumer and community sources such as information collected on the web, and passively collected data from technologies such as wearables and devices. Understanding the breadth of IT that collect these data and how it can be actioned is a challenge for the significant portion of the digital health workforce that interact with health data as part of their duties but are not for informatics experts. This viewpoint aims to present a taxonomy categorizing common information and communication technologies that collect electronic data. An initial classification of key information systems collecting electronic health data was undertaken via a rapid review of the literature. Subsequently, a purposeful search of the scholarly and gray literature was undertaken to extract key information about the systems within each category to generate definitions of the systems and describe the strengths and limitations of these systems.


Asunto(s)
Sistemas de Información en Salud , Humanos , Registros Electrónicos de Salud/clasificación
10.
IEEE J Biomed Health Inform ; 28(7): 4238-4248, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38635388

RESUMEN

Despite the vast potential for insights and value present in Electronic Health Records (EHRs), it is challenging to fully leverage all the available information, particularly that contained in the free-text data written by clinicians describing the health status of patients. The utilization of Named Entity Recognition and Linking tools allows not only for the structuring of information contained within free-text data, but also for the integration with medical ontologies, which may prove highly beneficial for the analysis of patient medical histories with the aim of forecasting future medical outcomes, such as the diagnosis of a new disorder. In this paper, we propose MedTKG, a Temporal Knowledge Graph (TKG) framework that incorporates both the dynamic information of patient clinical histories and the static information of medical ontologies. The TKG is used to model a medical history as a series of snapshots at different points in time, effectively capturing the dynamic nature of the patient's health status, while a static graph is used to model the hierarchies of concepts extracted from domain ontologies. The proposed method aims to predict future disorders by identifying missing objects in the quadruple 〈s, r, ?, t 〉, where s and r denote the patient and the disorder relation type, respectively, and t is the timestamp of the query. The method is evaluated on clinical notes extracted from MIMIC-III and demonstrates the effectiveness of the TKG framework in predicting future disorders and of medical ontologies in improving its performance.


Asunto(s)
Ontologías Biológicas , Registros Electrónicos de Salud , Humanos , Registros Electrónicos de Salud/clasificación , Algoritmos
11.
Pharmacol Res Perspect ; 8(6): e00687, 2020 12.
Artículo en Inglés | MEDLINE | ID: mdl-33280248

RESUMEN

Characterizing long-term prescription data is challenging due to the time-varying nature of drug use. Conventional approaches summarize time-varying data into categorical variables based on simple measures, such as cumulative dose, while ignoring patterns of use. The loss of information can lead to misclassification and biased estimates of the exposure-outcome association. We introduce a classification method to characterize longitudinal prescription data with an unsupervised machine learning algorithm. We used administrative databases covering virtually all 1.3 million residents of Manitoba and explicitly designed features to describe the average dose, proportion of days covered (PDC), dose change, and dose variability, and clustered the resulting feature space using K-means clustering. We applied this method to metformin use in diabetes patients. We identified 27,786 metformin users and showed that the feature distributions of their metformin use are stable for varying the lengths of follow-up and that these distributions have clear interpretations. We found six distinct metformin user groups: patients with intermittent use, decreasing dose, increasing dose, high dose, and two medium dose groups (one with stable dose and one with highly variable use). Patients in the varying and decreasing dose groups had a higher chance of progression of diabetes than other patients. The method presented in this paper allows for characterization of drug use into distinct and clinically relevant groups in a way that cannot be obtained from merely classifying use by quantiles of overall use.


Asunto(s)
Bases de Datos Factuales/clasificación , Diabetes Mellitus/tratamiento farmacológico , Diabetes Mellitus/epidemiología , Registros Electrónicos de Salud/clasificación , Hipoglucemiantes/uso terapéutico , Metformina/uso terapéutico , Adulto , Anciano , Algoritmos , Relación Dosis-Respuesta a Droga , Femenino , Estudios de Seguimiento , Humanos , Masculino , Persona de Mediana Edad , Ontario/epidemiología , Atención de Salud Universal
13.
J Am Med Inform Assoc ; 27(8): 1235-1243, 2020 08 01.
Artículo en Inglés | MEDLINE | ID: mdl-32548637

RESUMEN

OBJECTIVE: A major bottleneck hindering utilization of electronic health record data for translational research is the lack of precise phenotype labels. Chart review as well as rule-based and supervised phenotyping approaches require laborious expert input, hampering applicability to studies that require many phenotypes to be defined and labeled de novo. Though International Classification of Diseases codes are often used as surrogates for true labels in this setting, these sometimes suffer from poor specificity. We propose a fully automated topic modeling algorithm to simultaneously annotate multiple phenotypes. MATERIALS AND METHODS: Surrogate-guided ensemble latent Dirichlet allocation (sureLDA) is a label-free multidimensional phenotyping method. It first uses the PheNorm algorithm to initialize probabilities based on 2 surrogate features for each target phenotype, and then leverages these probabilities to constrain the LDA topic model to generate phenotype-specific topics. Finally, it combines phenotype-feature counts with surrogates via clustering ensemble to yield final phenotype probabilities. RESULTS: sureLDA achieves reliably high accuracy and precision across a range of simulated and real-world phenotypes. Its performance is robust to phenotype prevalence and relative informativeness of surogate vs nonsurrogate features. It also exhibits powerful feature selection properties. DISCUSSION: sureLDA combines attractive properties of PheNorm and LDA to achieve high accuracy and precision robust to diverse phenotype characteristics. It offers particular improvement for phenotypes insufficiently captured by a few surrogate features. Moreover, sureLDA's feature selection ability enables it to handle high feature dimensions and produce interpretable computational phenotypes. CONCLUSIONS: sureLDA is well suited toward large-scale electronic health record phenotyping for highly multiphenotype applications such as phenome-wide association studies .


Asunto(s)
Algoritmos , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Registros Electrónicos de Salud/clasificación , Humanos , Medicina de Precisión , Curva ROC , Investigación Biomédica Traslacional
14.
PLoS One ; 15(5): e0232840, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32396579

RESUMEN

Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence-for example, a single patient may generate multiple reports over the trajectory of a disease. In applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We test our approach on a corpus of 431,433 cancer pathology reports, and we show that incorporating case-level context significantly boosts classification accuracy across six classification tasks-site, subsite, laterality, histology, behavior, and grade. We expect that with minimal modifications, our add-on can be applied towards a wide range of other clinical text-based tasks.


Asunto(s)
Registros Electrónicos de Salud/clasificación , Neoplasias/patología , Técnicas Histológicas , Humanos , Procesamiento de Lenguaje Natural , Programa de VERF
15.
J Am Med Inform Assoc ; 27(6): 877-883, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32374408

RESUMEN

OBJECTIVE: Accurate electronic phenotyping is essential to support collaborative observational research. Supervised machine learning methods can be used to train phenotype classifiers in a high-throughput manner using imperfectly labeled data. We developed 10 phenotype classifiers using this approach and evaluated performance across multiple sites within the Observational Health Data Sciences and Informatics (OHDSI) network. MATERIALS AND METHODS: We constructed classifiers using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE) R-package, an open-source framework for learning phenotype classifiers using datasets in the Observational Medical Outcomes Partnership Common Data Model. We labeled training data based on the presence of multiple mentions of disease-specific codes. Performance was evaluated on cohorts derived using rule-based definitions and real-world disease prevalence. Classifiers were developed and evaluated across 3 medical centers, including 1 international site. RESULTS: Compared to the multiple mentions labeling heuristic, classifiers showed a mean recall boost of 0.43 with a mean precision loss of 0.17. Performance decreased slightly when classifiers were shared across medical centers, with mean recall and precision decreasing by 0.08 and 0.01, respectively, at a site within the USA, and by 0.18 and 0.10, respectively, at an international site. DISCUSSION AND CONCLUSION: We demonstrate a high-throughput pipeline for constructing and sharing phenotype classifiers across sites within the OHDSI network using APHRODITE. Classifiers exhibit good portability between sites within the USA, however limited portability internationally, indicating that classifier generalizability may have geographic limitations, and, consequently, sharing the classifier-building recipe, rather than the pretrained classifiers, may be more useful for facilitating collaborative observational research.


Asunto(s)
Registros Electrónicos de Salud/clasificación , Informática Médica , Aprendizaje Automático Supervisado , Clasificación/métodos , Ciencia de los Datos , Humanos , Estudios Observacionales como Asunto
16.
Nat Commun ; 11(1): 2536, 2020 05 21.
Artículo en Inglés | MEDLINE | ID: mdl-32439869

RESUMEN

Electronic health records (EHR) are rich heterogeneous collections of patient health information, whose broad adoption provides clinicians and researchers unprecedented opportunities for health informatics, disease-risk prediction, actionable clinical recommendations, and precision medicine. However, EHRs present several modeling challenges, including highly sparse data matrices, noisy irregular clinical notes, arbitrary biases in billing code assignment, diagnosis-driven lab tests, and heterogeneous data types. To address these challenges, we present MixEHR, a multi-view Bayesian topic model. We demonstrate MixEHR on MIMIC-III, Mayo Clinic Bipolar Disorder, and Quebec Congenital Heart Disease EHR datasets. Qualitatively, MixEHR disease topics reveal meaningful combinations of clinical features across heterogeneous data types. Quantitatively, we observe superior prediction accuracy of diagnostic codes and lab test imputations compared to the state-of-art methods. We leverage the inferred patient topic mixtures to classify target diseases and predict mortality of patients in critical conditions. In all comparison, MixEHR confers competitive performance and reveals meaningful disease-related topics.


Asunto(s)
Registros Electrónicos de Salud/clasificación , Informática Médica/métodos , Teorema de Bayes , Bases de Datos Factuales , Registros Electrónicos de Salud/estadística & datos numéricos , Humanos , Aprendizaje Automático , Modelos Estadísticos , Fenotipo
17.
Anesthesiology ; 132(4): 738-749, 2020 04.
Artículo en Inglés | MEDLINE | ID: mdl-32028374

RESUMEN

BACKGROUND: Accurate anesthesiology procedure code data are essential to quality improvement, research, and reimbursement tasks within anesthesiology practices. Advanced data science techniques, including machine learning and natural language processing, offer opportunities to develop classification tools for Current Procedural Terminology codes across anesthesia procedures. METHODS: Models were created using a Train/Test dataset including 1,164,343 procedures from 16 academic and private hospitals. Five supervised machine learning models were created to classify anesthesiology Current Procedural Terminology codes, with accuracy defined as first choice classification matching the institutional-assigned code existing in the perioperative database. The two best performing models were further refined and tested on a Holdout dataset from a single institution distinct from Train/Test. A tunable confidence parameter was created to identify cases for which models were highly accurate, with the goal of at least 95% accuracy, above the reported 2018 Centers for Medicare and Medicaid Services (Baltimore, Maryland) fee-for-service accuracy. Actual submitted claim data from billing specialists were used as a reference standard. RESULTS: Support vector machine and neural network label-embedding attentive models were the best performing models, respectively, demonstrating overall accuracies of 87.9% and 84.2% (single best code), and 96.8% and 94.0% (within top three). Classification accuracy was 96.4% in 47.0% of cases using support vector machine and 94.4% in 62.2% of cases using label-embedding attentive model within the Train/Test dataset. In the Holdout dataset, respective classification accuracies were 93.1% in 58.0% of cases and 95.0% among 62.0%. The most important feature in model training was procedure text. CONCLUSIONS: Through application of machine learning and natural language processing techniques, highly accurate real-time models were created for anesthesiology Current Procedural Terminology code classification. The increased processing speed and a priori targeted accuracy of this classification approach may provide performance optimization and cost reduction for quality improvement, research, and reimbursement tasks reliant on anesthesiology procedure codes.


Asunto(s)
Current Procedural Terminology , Bases de Datos Factuales/clasificación , Registros Electrónicos de Salud/clasificación , Aprendizaje Automático/clasificación , Redes Neurales de la Computación , Adolescente , Adulto , Niño , Preescolar , Femenino , Humanos , Masculino , Persona de Mediana Edad , Adulto Joven
18.
AMIA Annu Symp Proc ; 2020: 273-282, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33936399

RESUMEN

Research has demonstrated cohort misclassification when studies of suicidal thoughts and behaviors (STBs) rely on ICD-9/10-CM diagnosis codes. Electronic health record (EHR) data are being explored to better identify patients, a process called EHR phenotyping. Most STB phenotyping studies have used structured EHR data, but some are beginning to incorporate unstructured clinical text. In this study, we used a publicly-accessible natural language processing (NLP) program for biomedical text (MetaMap) and iterative elastic net regression to extract and select predictive text features from the discharge summaries of 810 inpatient admissions of interest. Initial sets of 5,866 and 2,709 text features were reduced to 18 and 11, respectively. The two models fit with these features obtained an area under the receiver operating characteristic curve of 0.866-0.895 and an area under the precision-recall curve of 0.800-0.838, demonstrating the approach's potential to identify textual features to incorporate in phenotyping models.


Asunto(s)
Algoritmos , Minería de Datos/métodos , Registros Electrónicos de Salud/clasificación , Procesamiento de Lenguaje Natural , Intento de Suicidio/clasificación , Estudios de Cohortes , Femenino , Humanos , Clasificación Internacional de Enfermedades , Aprendizaje Automático , Masculino , Fenotipo , Prevalencia , Curva ROC
19.
J Am Med Inform Assoc ; 27(1): 119-126, 2020 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-31722396

RESUMEN

OBJECTIVE: Phenotyping patients using electronic health record (EHR) data conventionally requires labeled cases and controls. Assigning labels requires manual medical chart review and therefore is labor intensive. For some phenotypes, identifying gold-standard controls is prohibitive. We developed an accurate EHR phenotyping approach that does not require labeled controls. MATERIALS AND METHODS: Our framework relies on a random subset of cases, which can be specified using an anchor variable that has excellent positive predictive value and sensitivity independent of predictors. We proposed a maximum likelihood approach that efficiently leverages data from the specified cases and unlabeled patients to develop logistic regression phenotyping models, and compare model performance with existing algorithms. RESULTS: Our method outperformed the existing algorithms on predictive accuracy in Monte Carlo simulation studies, application to identify hypertension patients with hypokalemia requiring oral supplementation using a simulated anchor, and application to identify primary aldosteronism patients using real-world cases and anchor variables. Our method additionally generated consistent estimates of 2 important parameters, phenotype prevalence and the proportion of true cases that are labeled. DISCUSSION: Upon identification of an anchor variable that is scalable and transferable to different practices, our approach should facilitate development of scalable, transferable, and practice-specific phenotyping models. CONCLUSIONS: Our proposed approach enables accurate semiautomated EHR phenotyping with minimal manual labeling and therefore should greatly facilitate EHR clinical decision support and research.


Asunto(s)
Algoritmos , Registros Electrónicos de Salud/clasificación , Funciones de Verosimilitud , Humanos , Método de Montecarlo
20.
Comput Methods Programs Biomed ; 188: 105264, 2020 May.
Artículo en Inglés | MEDLINE | ID: mdl-31851906

RESUMEN

BACKGROUND AND OBJECTIVE: This work deals with clinical text mining, a field of Natural Language Processing applied to biomedical informatics. The aim is to classify Electronic Health Records with respect to the International Classification of Diseases, which is the foundation for the identification of international health statistics, and the standard for reporting diseases and health conditions. Within the framework of data mining, the goal is the multi-label classification, as each health record has assigned multiple International Classification of Diseases codes. We investigate five Deep Learning architectures with a dataset obtained from the Basque Country Health System, and six different perspectives derived from shifts in the input and the output. METHODS: We evaluate a Feed Forward Neural Network as the baseline and several Recurrent models based on the Bidirectional GRU architecture, putting our research focus on the text representation layer and testing three variants, from standard word embeddings to meta word embeddings techniques and contextual embeddings. RESULTS: The results showed that the recurrent models overcome the non-recurrent model. The meta word embeddings techniques are capable of beating the standard word embeddings, but the contextual embeddings exhibit as the most robust for the downstream task overall. Additionally, the label-granularity alone has an impact on the classification performance. CONCLUSIONS: The contributions of this work are a) a comparison among five classification approaches based on Deep Learning on a Spanish dataset to cope with the multi-label health text classification problem; b) the study of the impact of document length and label-set size and granularity in the multi-label context; and c) the study of measures to mitigate multi-label text classification problems related to label-set size and sparseness.


Asunto(s)
Aprendizaje Profundo , Registros Electrónicos de Salud/clasificación , Informática Médica , Reconocimiento de Normas Patrones Automatizadas , Algoritmos , Gráficos por Computador , Minería de Datos , Humanos , Clasificación Internacional de Enfermedades , Procesamiento de Lenguaje Natural , Redes Neurales de la Computación , Programas Informáticos , España
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...