Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 203
Filtrar
1.
Stud Health Technol Inform ; 318: 18-23, 2024 Sep 24.
Artigo em Inglês | MEDLINE | ID: mdl-39320175

RESUMO

While Fast Healthcare Interoperability Resources (FHIR) clinical terminology server enables quick and easy search and retrieval of coded medical data, it still has some drawbacks. When searching, any typographical errors, variations in word forms, or deviations in word sequence might lead to incorrect search outcomes. For retrieval, queries to the server must strictly follow the FHIR application programming interface format, which requires users to know the syntax and remember the attribute codes they wish to retrieve. To improve its functionalities, a natural language interface was built, that harnesses the capabilities of two preeminent large language models, along with other cutting-edge technologies such as speech-to-text conversion, vector semantic searching, and conversational artificial intelligence. Preliminary evaluation shows promising results in building a natural language interface for the FHIR clinical terminology system.


Assuntos
Processamento de Linguagem Natural , Interface Usuário-Computador , Terminologia como Assunto , Interoperabilidade da Informação em Saúde , Vocabulário Controlado , Armazenamento e Recuperação da Informação/métodos , Humanos , Registros Eletrônicos de Saúde/classificação , Semântica , Inteligência Artificial
2.
Stud Health Technol Inform ; 316: 267-271, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176724

RESUMO

Automatic alerting systems (AASs) can identify adverse health events but emergency communication relies on human operators and natural languages. For complete automation, we need to code the diversity of adverse events in a granularity that supports optimal dispatches. Hence, AAs shall integrate with the International Classification of Diseases (ICD). The ICD-11 coding system includes chapters for external causes of injury. However, ICD-11 supports coding injury incidents in electronic health records (EHRs) after they have occurred, while disregarding integrating real-time injury reporting within its framework. We explore the potential challenges associated with integrating ICD-11 into AAS by analyzing external causes of morbidity or mortality and the dimensions of external causes as potential areas of integration. We recognize the themes: (i) incident of injury, (ii) mode of transport, (iii) indoor location, (iv) outdoor location, and (v) type of building, and identify four challenges: (i) conceptual differences between the two systems, (ii) injury identification, (iii) presence of entities below the shoreline in ICD-11, and (iv) lack of specificity in certain ICD-11 codes related to AASs. For easy integration of ICD-11 into AASs, we recommend an AAS data dictionary and propose ICD-11 updates related to external causes of injury.


Assuntos
Registros Eletrônicos de Saúde , Classificação Internacional de Doenças , Registros Eletrônicos de Saúde/classificação , Humanos , Integração de Sistemas
3.
Stud Health Technol Inform ; 316: 695-699, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176890

RESUMO

Annotated language resources derived from clinical routine documentation form an intriguing asset for secondary use case scenarios. In this investigation, we report on how such a resource can be leveraged to identify additional term candidates for a chosen set of ICD-10 codes. We conducted a log-likelihood analysis, considering the co-occurrence of approximately 1.9 million de-identified ICD-10 codes alongside corresponding brief textual entries from problem lists in German. This analysis aimed to identify potential candidates with statistical significance set at p < 0.01, which were used as seed terms to harvest additional candidates by interfacing to a large language model in a second step. The proposed approach can identify additional term candidates at suitable performance values: hypernyms MAP@5=0.801, synonyms MAP@5 = 0.723 and hyponyms MAP@5 = 0.507. The re-use of existing annotated clinical datasets, in combination with large language models, presents an interesting strategy to bridge the lexical gap in standardized clinical terminologies and real-world jargon.


Assuntos
Classificação Internacional de Doenças , Processamento de Linguagem Natural , Vocabulário Controlado , Humanos , Terminologia como Assunto , Registros Eletrônicos de Saúde/classificação , Alemanha
4.
Stud Health Technol Inform ; 316: 846-850, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176925

RESUMO

Text classification plays an essential role in the medical domain by organizing and categorizing vast amounts of textual data through machine learning (ML) and deep learning (DL). The adoption of Artificial Intelligence (AI) technologies in healthcare has raised concerns about the interpretability of AI models, often perceived as "black boxes." Explainable AI (XAI) techniques aim to mitigate this issue by elucidating AI model decision-making process. In this paper, we present a scoping review exploring the application of different XAI techniques in medical text classification, identifying two main types: model-specific and model-agnostic methods. Despite some positive feedback from developers, formal evaluations with medical end users of these techniques remain limited. The review highlights the necessity for further research in XAI to enhance trust and transparency in AI-driven decision-making processes in healthcare.


Assuntos
Inteligência Artificial , Processamento de Linguagem Natural , Humanos , Aprendizado de Máquina , Registros Eletrônicos de Saúde/classificação , Aprendizado Profundo
5.
Stud Health Technol Inform ; 316: 1008-1012, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176961

RESUMO

Coding according to the International Classification of Diseases (ICD)-10 and its clinical modifications (CM) is inherently complex and expensive. Natural Language Processing (NLP) assists by simplifying the analysis of unstructured data from electronic health records, thereby facilitating diagnosis coding. This study investigates the suitability of transformer models for ICD-10 classification, considering both encoder and encoder-decoder architectures. The analysis is performed on clinical discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset, which contains an extensive collection of electronic health records. Pre-trained models such as BioBERT, ClinicalBERT, ClinicalLongformer, and ClinicalBigBird are adapted for the coding task, incorporating specific preprocessing techniques to enhance performance. The findings indicate that increasing context length improves accuracy, and that the difference in accuracy between encoder and encoder-decoder models is negligible.


Assuntos
Registros Eletrônicos de Saúde , Classificação Internacional de Doenças , Processamento de Linguagem Natural , Registros Eletrônicos de Saúde/classificação , Humanos , Codificação Clínica
6.
Stud Health Technol Inform ; 316: 834-838, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176922

RESUMO

Digital individual participant data (IPD) from clinical trials are increasingly distributed for potential scientific reuse. The identification of available IPD, however, requires interpretations of textual data-sharing statements (DSS) in large databases. Recent advancements in computational linguistics include pre-trained language models that promise to simplify the implementation of effective classifiers based on textual inputs. In a subset of 5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers based on domain-specific pre-trained language models reproduce original availability categories as well as manually annotated labels. Typical metrics indicate that classifiers that predicted manual annotations outperformed those that learned to output the original availability categories. This suggests that the textual DSS descriptions contain applicable information that the availability categories do not, and that such classifiers could thus aid the automatic identification of available IPD in large trial databases.


Assuntos
Ensaios Clínicos como Assunto , Disseminação de Informação , Humanos , Processamento de Linguagem Natural , Registros Eletrônicos de Saúde/classificação
7.
Stud Health Technol Inform ; 315: 368-372, 2024 Jul 24.
Artigo em Inglês | MEDLINE | ID: mdl-39049285

RESUMO

This paper explores the balance between fairness and performance in machine learning classification, predicting the likelihood of a patient receiving anti-microbial treatment using structured data in community nursing wound care electronic health records. The data includes two important predictors (gender and language) of the social determinants of health, which we used to evaluate the fairness of the classifiers. At the same time, the impact of various groupings of language codes on classifiers' performance and fairness is analyzed. Most common statistical learning-based classifiers are evaluated. The findings indicate that while K-Nearest Neighbors offers the best fairness metrics among different grouping settings, the performance of all classifiers is generally consistent across different language code groupings. Also, grouping more variables tends to improve the fairness metrics over all classifiers while maintaining their performance.


Assuntos
Registros Eletrônicos de Saúde , Equidade em Saúde , Aprendizado de Máquina , Registros Eletrônicos de Saúde/classificação , Humanos , Determinantes Sociais da Saúde
8.
J Med Internet Res ; 26: e47682, 2024 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-38820575

RESUMO

The health sector is highly digitized, which is enabling the collection of vast quantities of electronic data about health and well-being. These data are collected by a diverse array of information and communication technologies, including systems used by health care organizations, consumer and community sources such as information collected on the web, and passively collected data from technologies such as wearables and devices. Understanding the breadth of IT that collect these data and how it can be actioned is a challenge for the significant portion of the digital health workforce that interact with health data as part of their duties but are not for informatics experts. This viewpoint aims to present a taxonomy categorizing common information and communication technologies that collect electronic data. An initial classification of key information systems collecting electronic health data was undertaken via a rapid review of the literature. Subsequently, a purposeful search of the scholarly and gray literature was undertaken to extract key information about the systems within each category to generate definitions of the systems and describe the strengths and limitations of these systems.


Assuntos
Sistemas de Informação em Saúde , Humanos , Registros Eletrônicos de Saúde/classificação
9.
J Am Med Inform Assoc ; 31(7): 1493-1502, 2024 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-38742455

RESUMO

BACKGROUND: Error analysis plays a crucial role in clinical concept extraction, a fundamental subtask within clinical natural language processing (NLP). The process typically involves a manual review of error types, such as contextual and linguistic factors contributing to their occurrence, and the identification of underlying causes to refine the NLP model and improve its performance. Conducting error analysis can be complex, requiring a combination of NLP expertise and domain-specific knowledge. Due to the high heterogeneity of electronic health record (EHR) settings across different institutions, challenges may arise when attempting to standardize and reproduce the error analysis process. OBJECTIVES: This study aims to facilitate a collaborative effort to establish common definitions and taxonomies for capturing diverse error types, fostering community consensus on error analysis for clinical concept extraction tasks. MATERIALS AND METHODS: We iteratively developed and evaluated an error taxonomy based on existing literature, standards, real-world data, multisite case evaluations, and community feedback. The finalized taxonomy was released in both .dtd and .owl formats at the Open Health Natural Language Processing Consortium. The taxonomy is compatible with several different open-source annotation tools, including MAE, Brat, and MedTator. RESULTS: The resulting error taxonomy comprises 43 distinct error classes, organized into 6 error dimensions and 4 properties, including model type (symbolic and statistical machine learning), evaluation subject (model and human), evaluation level (patient, document, sentence, and concept), and annotation examples. Internal and external evaluations revealed strong variations in error types across methodological approaches, tasks, and EHR settings. Key points emerged from community feedback, including the need to enhancing clarity, generalizability, and usability of the taxonomy, along with dissemination strategies. CONCLUSION: The proposed taxonomy can facilitate the acceleration and standardization of the error analysis process in multi-site settings, thus improving the provenance, interpretability, and portability of NLP models. Future researchers could explore the potential direction of developing automated or semi-automated methods to assist in the classification and standardization of error analysis.


Assuntos
Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Registros Eletrônicos de Saúde/classificação , Humanos , Classificação/métodos , Erros Médicos/classificação
10.
IEEE J Biomed Health Inform ; 28(7): 4238-4248, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38635388

RESUMO

Despite the vast potential for insights and value present in Electronic Health Records (EHRs), it is challenging to fully leverage all the available information, particularly that contained in the free-text data written by clinicians describing the health status of patients. The utilization of Named Entity Recognition and Linking tools allows not only for the structuring of information contained within free-text data, but also for the integration with medical ontologies, which may prove highly beneficial for the analysis of patient medical histories with the aim of forecasting future medical outcomes, such as the diagnosis of a new disorder. In this paper, we propose MedTKG, a Temporal Knowledge Graph (TKG) framework that incorporates both the dynamic information of patient clinical histories and the static information of medical ontologies. The TKG is used to model a medical history as a series of snapshots at different points in time, effectively capturing the dynamic nature of the patient's health status, while a static graph is used to model the hierarchies of concepts extracted from domain ontologies. The proposed method aims to predict future disorders by identifying missing objects in the quadruple 〈s, r, ?, t 〉, where s and r denote the patient and the disorder relation type, respectively, and t is the timestamp of the query. The method is evaluated on clinical notes extracted from MIMIC-III and demonstrates the effectiveness of the TKG framework in predicting future disorders and of medical ontologies in improving its performance.


Assuntos
Ontologias Biológicas , Registros Eletrônicos de Saúde , Humanos , Registros Eletrônicos de Saúde/classificação , Algoritmos
11.
Pharmacol Res Perspect ; 8(6): e00687, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-33280248

RESUMO

Characterizing long-term prescription data is challenging due to the time-varying nature of drug use. Conventional approaches summarize time-varying data into categorical variables based on simple measures, such as cumulative dose, while ignoring patterns of use. The loss of information can lead to misclassification and biased estimates of the exposure-outcome association. We introduce a classification method to characterize longitudinal prescription data with an unsupervised machine learning algorithm. We used administrative databases covering virtually all 1.3 million residents of Manitoba and explicitly designed features to describe the average dose, proportion of days covered (PDC), dose change, and dose variability, and clustered the resulting feature space using K-means clustering. We applied this method to metformin use in diabetes patients. We identified 27,786 metformin users and showed that the feature distributions of their metformin use are stable for varying the lengths of follow-up and that these distributions have clear interpretations. We found six distinct metformin user groups: patients with intermittent use, decreasing dose, increasing dose, high dose, and two medium dose groups (one with stable dose and one with highly variable use). Patients in the varying and decreasing dose groups had a higher chance of progression of diabetes than other patients. The method presented in this paper allows for characterization of drug use into distinct and clinically relevant groups in a way that cannot be obtained from merely classifying use by quantiles of overall use.


Assuntos
Bases de Dados Factuais/classificação , Diabetes Mellitus/tratamento farmacológico , Diabetes Mellitus/epidemiologia , Registros Eletrônicos de Saúde/classificação , Hipoglicemiantes/uso terapêutico , Metformina/uso terapêutico , Adulto , Idoso , Algoritmos , Relação Dose-Resposta a Droga , Feminino , Seguimentos , Humanos , Masculino , Pessoa de Meia-Idade , Ontário/epidemiologia , Assistência de Saúde Universal
12.
J Am Med Inform Assoc ; 27(8): 1235-1243, 2020 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-32548637

RESUMO

OBJECTIVE: A major bottleneck hindering utilization of electronic health record data for translational research is the lack of precise phenotype labels. Chart review as well as rule-based and supervised phenotyping approaches require laborious expert input, hampering applicability to studies that require many phenotypes to be defined and labeled de novo. Though International Classification of Diseases codes are often used as surrogates for true labels in this setting, these sometimes suffer from poor specificity. We propose a fully automated topic modeling algorithm to simultaneously annotate multiple phenotypes. MATERIALS AND METHODS: Surrogate-guided ensemble latent Dirichlet allocation (sureLDA) is a label-free multidimensional phenotyping method. It first uses the PheNorm algorithm to initialize probabilities based on 2 surrogate features for each target phenotype, and then leverages these probabilities to constrain the LDA topic model to generate phenotype-specific topics. Finally, it combines phenotype-feature counts with surrogates via clustering ensemble to yield final phenotype probabilities. RESULTS: sureLDA achieves reliably high accuracy and precision across a range of simulated and real-world phenotypes. Its performance is robust to phenotype prevalence and relative informativeness of surogate vs nonsurrogate features. It also exhibits powerful feature selection properties. DISCUSSION: sureLDA combines attractive properties of PheNorm and LDA to achieve high accuracy and precision robust to diverse phenotype characteristics. It offers particular improvement for phenotypes insufficiently captured by a few surrogate features. Moreover, sureLDA's feature selection ability enables it to handle high feature dimensions and produce interpretable computational phenotypes. CONCLUSIONS: sureLDA is well suited toward large-scale electronic health record phenotyping for highly multiphenotype applications such as phenome-wide association studies .


Assuntos
Algoritmos , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Registros Eletrônicos de Saúde/classificação , Humanos , Medicina de Precisão , Curva ROC , Pesquisa Translacional Biomédica
14.
J Am Med Inform Assoc ; 27(6): 877-883, 2020 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-32374408

RESUMO

OBJECTIVE: Accurate electronic phenotyping is essential to support collaborative observational research. Supervised machine learning methods can be used to train phenotype classifiers in a high-throughput manner using imperfectly labeled data. We developed 10 phenotype classifiers using this approach and evaluated performance across multiple sites within the Observational Health Data Sciences and Informatics (OHDSI) network. MATERIALS AND METHODS: We constructed classifiers using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE) R-package, an open-source framework for learning phenotype classifiers using datasets in the Observational Medical Outcomes Partnership Common Data Model. We labeled training data based on the presence of multiple mentions of disease-specific codes. Performance was evaluated on cohorts derived using rule-based definitions and real-world disease prevalence. Classifiers were developed and evaluated across 3 medical centers, including 1 international site. RESULTS: Compared to the multiple mentions labeling heuristic, classifiers showed a mean recall boost of 0.43 with a mean precision loss of 0.17. Performance decreased slightly when classifiers were shared across medical centers, with mean recall and precision decreasing by 0.08 and 0.01, respectively, at a site within the USA, and by 0.18 and 0.10, respectively, at an international site. DISCUSSION AND CONCLUSION: We demonstrate a high-throughput pipeline for constructing and sharing phenotype classifiers across sites within the OHDSI network using APHRODITE. Classifiers exhibit good portability between sites within the USA, however limited portability internationally, indicating that classifier generalizability may have geographic limitations, and, consequently, sharing the classifier-building recipe, rather than the pretrained classifiers, may be more useful for facilitating collaborative observational research.


Assuntos
Registros Eletrônicos de Saúde/classificação , Informática Médica , Aprendizado de Máquina Supervisionado , Classificação/métodos , Ciência de Dados , Humanos , Estudos Observacionais como Assunto
15.
PLoS One ; 15(5): e0232840, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32396579

RESUMO

Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence-for example, a single patient may generate multiple reports over the trajectory of a disease. In applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We test our approach on a corpus of 431,433 cancer pathology reports, and we show that incorporating case-level context significantly boosts classification accuracy across six classification tasks-site, subsite, laterality, histology, behavior, and grade. We expect that with minimal modifications, our add-on can be applied towards a wide range of other clinical text-based tasks.


Assuntos
Registros Eletrônicos de Saúde/classificação , Neoplasias/patologia , Técnicas Histológicas , Humanos , Processamento de Linguagem Natural , Programa de SEER
16.
Nat Commun ; 11(1): 2536, 2020 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-32439869

RESUMO

Electronic health records (EHR) are rich heterogeneous collections of patient health information, whose broad adoption provides clinicians and researchers unprecedented opportunities for health informatics, disease-risk prediction, actionable clinical recommendations, and precision medicine. However, EHRs present several modeling challenges, including highly sparse data matrices, noisy irregular clinical notes, arbitrary biases in billing code assignment, diagnosis-driven lab tests, and heterogeneous data types. To address these challenges, we present MixEHR, a multi-view Bayesian topic model. We demonstrate MixEHR on MIMIC-III, Mayo Clinic Bipolar Disorder, and Quebec Congenital Heart Disease EHR datasets. Qualitatively, MixEHR disease topics reveal meaningful combinations of clinical features across heterogeneous data types. Quantitatively, we observe superior prediction accuracy of diagnostic codes and lab test imputations compared to the state-of-art methods. We leverage the inferred patient topic mixtures to classify target diseases and predict mortality of patients in critical conditions. In all comparison, MixEHR confers competitive performance and reveals meaningful disease-related topics.


Assuntos
Registros Eletrônicos de Saúde/classificação , Informática Médica/métodos , Teorema de Bayes , Bases de Dados Factuais , Registros Eletrônicos de Saúde/estatística & dados numéricos , Humanos , Aprendizado de Máquina , Modelos Estatísticos , Fenótipo
17.
Anesthesiology ; 132(4): 738-749, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32028374

RESUMO

BACKGROUND: Accurate anesthesiology procedure code data are essential to quality improvement, research, and reimbursement tasks within anesthesiology practices. Advanced data science techniques, including machine learning and natural language processing, offer opportunities to develop classification tools for Current Procedural Terminology codes across anesthesia procedures. METHODS: Models were created using a Train/Test dataset including 1,164,343 procedures from 16 academic and private hospitals. Five supervised machine learning models were created to classify anesthesiology Current Procedural Terminology codes, with accuracy defined as first choice classification matching the institutional-assigned code existing in the perioperative database. The two best performing models were further refined and tested on a Holdout dataset from a single institution distinct from Train/Test. A tunable confidence parameter was created to identify cases for which models were highly accurate, with the goal of at least 95% accuracy, above the reported 2018 Centers for Medicare and Medicaid Services (Baltimore, Maryland) fee-for-service accuracy. Actual submitted claim data from billing specialists were used as a reference standard. RESULTS: Support vector machine and neural network label-embedding attentive models were the best performing models, respectively, demonstrating overall accuracies of 87.9% and 84.2% (single best code), and 96.8% and 94.0% (within top three). Classification accuracy was 96.4% in 47.0% of cases using support vector machine and 94.4% in 62.2% of cases using label-embedding attentive model within the Train/Test dataset. In the Holdout dataset, respective classification accuracies were 93.1% in 58.0% of cases and 95.0% among 62.0%. The most important feature in model training was procedure text. CONCLUSIONS: Through application of machine learning and natural language processing techniques, highly accurate real-time models were created for anesthesiology Current Procedural Terminology code classification. The increased processing speed and a priori targeted accuracy of this classification approach may provide performance optimization and cost reduction for quality improvement, research, and reimbursement tasks reliant on anesthesiology procedure codes.


Assuntos
Current Procedural Terminology , Bases de Dados Factuais/classificação , Registros Eletrônicos de Saúde/classificação , Aprendizado de Máquina/classificação , Redes Neurais de Computação , Adolescente , Adulto , Criança , Pré-Escolar , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Adulto Jovem
18.
J Am Med Inform Assoc ; 27(2): 244-253, 2020 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-31617899

RESUMO

OBJECTIVES: The ability to identify novel risk factors for health outcomes is a key strength of electronic health record (EHR)-based research. However, the validity of such studies is limited by error in EHR-derived phenotypes. The objective of this study was to develop a novel procedure for reducing bias in estimated associations between risk factors and phenotypes in EHR data. MATERIALS AND METHODS: The proposed method combines the strengths of a gold-standard phenotype obtained through manual chart review for a small validation set of patients and an automatically-derived phenotype that is available for all patients but is potentially error-prone (hereafter referred to as the algorithm-derived phenotype). An augmented estimator of associations is obtained by optimally combining these 2 phenotypes. We conducted simulation studies to evaluate the performance of the augmented estimator and conducted an analysis of risk factors for second breast cancer events using data on a cohort from Kaiser Permanente Washington. RESULTS: The proposed method was shown to reduce bias relative to an estimator using only the algorithm-derived phenotype and reduce variance compared to an estimator using only the validation data. DISCUSSION: Our simulation studies and real data application demonstrate that, compared to the estimator using validation data only, the augmented estimator has lower variance (ie, higher statistical efficiency). Compared to the estimator using error-prone EHR-derived phenotypes, the augmented estimator has smaller bias. CONCLUSIONS: The proposed estimator can effectively combine an error-prone phenotype with gold-standard data from a limited chart review in order to improve analyses of risk factors using EHR data.


Assuntos
Algoritmos , Registros Eletrônicos de Saúde/classificação , Viés , Data Warehousing , Humanos
19.
J Am Med Inform Assoc ; 27(1): 119-126, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31722396

RESUMO

OBJECTIVE: Phenotyping patients using electronic health record (EHR) data conventionally requires labeled cases and controls. Assigning labels requires manual medical chart review and therefore is labor intensive. For some phenotypes, identifying gold-standard controls is prohibitive. We developed an accurate EHR phenotyping approach that does not require labeled controls. MATERIALS AND METHODS: Our framework relies on a random subset of cases, which can be specified using an anchor variable that has excellent positive predictive value and sensitivity independent of predictors. We proposed a maximum likelihood approach that efficiently leverages data from the specified cases and unlabeled patients to develop logistic regression phenotyping models, and compare model performance with existing algorithms. RESULTS: Our method outperformed the existing algorithms on predictive accuracy in Monte Carlo simulation studies, application to identify hypertension patients with hypokalemia requiring oral supplementation using a simulated anchor, and application to identify primary aldosteronism patients using real-world cases and anchor variables. Our method additionally generated consistent estimates of 2 important parameters, phenotype prevalence and the proportion of true cases that are labeled. DISCUSSION: Upon identification of an anchor variable that is scalable and transferable to different practices, our approach should facilitate development of scalable, transferable, and practice-specific phenotyping models. CONCLUSIONS: Our proposed approach enables accurate semiautomated EHR phenotyping with minimal manual labeling and therefore should greatly facilitate EHR clinical decision support and research.


Assuntos
Algoritmos , Registros Eletrônicos de Saúde/classificação , Funções Verossimilhança , Humanos , Método de Monte Carlo
20.
Comput Methods Programs Biomed ; 188: 105264, 2020 May.
Artigo em Inglês | MEDLINE | ID: mdl-31851906

RESUMO

BACKGROUND AND OBJECTIVE: This work deals with clinical text mining, a field of Natural Language Processing applied to biomedical informatics. The aim is to classify Electronic Health Records with respect to the International Classification of Diseases, which is the foundation for the identification of international health statistics, and the standard for reporting diseases and health conditions. Within the framework of data mining, the goal is the multi-label classification, as each health record has assigned multiple International Classification of Diseases codes. We investigate five Deep Learning architectures with a dataset obtained from the Basque Country Health System, and six different perspectives derived from shifts in the input and the output. METHODS: We evaluate a Feed Forward Neural Network as the baseline and several Recurrent models based on the Bidirectional GRU architecture, putting our research focus on the text representation layer and testing three variants, from standard word embeddings to meta word embeddings techniques and contextual embeddings. RESULTS: The results showed that the recurrent models overcome the non-recurrent model. The meta word embeddings techniques are capable of beating the standard word embeddings, but the contextual embeddings exhibit as the most robust for the downstream task overall. Additionally, the label-granularity alone has an impact on the classification performance. CONCLUSIONS: The contributions of this work are a) a comparison among five classification approaches based on Deep Learning on a Spanish dataset to cope with the multi-label health text classification problem; b) the study of the impact of document length and label-set size and granularity in the multi-label context; and c) the study of measures to mitigate multi-label text classification problems related to label-set size and sparseness.


Assuntos
Aprendizado Profundo , Registros Eletrônicos de Saúde/classificação , Informática Médica , Reconhecimento Automatizado de Padrão , Algoritmos , Gráficos por Computador , Mineração de Dados , Humanos , Classificação Internacional de Doenças , Processamento de Linguagem Natural , Redes Neurais de Computação , Software , Espanha
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA