Búsqueda | Portal de Búsqueda de la BVS España

1.

An Extensible Evaluation Framework Applied to Clinical Text Deidentification Natural Language Processing Tools: Multisystem and Multicorpus Study.

Heider, Paul M; Meystre, Stéphane M.

J Med Internet Res ; 26: e55676, 2024 May 28.

Artículo en Inglés | MEDLINE | ID: mdl-38805692

RESUMEN

BACKGROUND: Clinical natural language processing (NLP) researchers need access to directly comparable evaluation results for applications such as text deidentification across a range of corpus types and the means to easily test new systems or corpora within the same framework. Current systems, reported metrics, and the personally identifiable information (PII) categories evaluated are not easily comparable. OBJECTIVE: This study presents an open-source and extensible end-to-end framework for comparing clinical NLP system performance across corpora even when the annotation categories do not align. METHODS: As a use case for this framework, we use 6 off-the-shelf text deidentification systems (ie, CliniDeID, deid from PhysioNet, MITRE Identity Scrubber Toolkit [MIST], NeuroNER, National Library of Medicine [NLM] Scrubber, and Philter) across 3 standard clinical text corpora for the task (2 of which are publicly available) and 1 private corpus (all in English), with annotation categories that are not directly analogous. The framework is built on shell scripts that can be extended to include new systems, corpora, and performance metrics. We present this open tool, multiple means for aligning PII categories during evaluation, and our initial timing and performance metric findings. Code for running this framework with all settings needed to run all pairs are available via Codeberg and GitHub. RESULTS: From this case study, we found large differences in processing speed between systems. The fastest system (ie, MIST) processed an average of 24.57 (SD 26.23) notes per second, while the slowest (ie, CliniDeID) processed an average of 1.00 notes per second. No system uniformly outperformed the others at identifying PII across corpora and categories. Instead, a rich tapestry of performance trade-offs emerged for PII categories. CliniDeID and Philter prioritize recall over precision (with an average recall 6.9 and 11.2 points higher, respectively, for partially matching spans of text matching any PII category), while the other 4 systems consistently have higher precision (with MIST's precision scoring 20.2 points higher, NLM Scrubber scoring 4.4 points higher, NeuroNER scoring 7.2 points higher, and deid scoring 17.1 points higher). The macroaverage recall across corpora for identifying names, one of the more sensitive PII categories, included deid (48.8%) and MIST (66.9%) at the low end and NeuroNER (84.1%), NLM Scrubber (88.1%), and CliniDeID (95.9%) at the high end. A variety of metrics across categories and corpora are reported with a wider variety (eg, F2-score) available via the tool. CONCLUSIONS: NLP systems in general and deidentification systems and corpora in our use case tend to be evaluated in stand-alone research articles that only include a limited set of comparators. We hold that a single evaluation pipeline across multiple systems and corpora allows for more nuanced comparisons. Our open pipeline should reduce barriers to evaluation and system advancement.

Asunto(s)

Procesamiento de Lenguaje Natural

2.

Piloting an automated clinical trial eligibility surveillance and provider alert system based on artificial intelligence and standard data models.

Meystre, Stéphane M; Heider, Paul M; Cates, Andrew; Bastian, Grace; Pittman, Tara; Gentilin, Stephanie; Kelechi, Teresa J.

BMC Med Res Methodol ; 23(1): 88, 2023 04 11.

Artículo en Inglés | MEDLINE | ID: mdl-37041475

RESUMEN

BACKGROUND: To advance new therapies into clinical care, clinical trials must recruit enough participants. Yet, many trials fail to do so, leading to delays, early trial termination, and wasted resources. Under-enrolling trials make it impossible to draw conclusions about the efficacy of new therapies. An oft-cited reason for insufficient enrollment is lack of study team and provider awareness about patient eligibility. Automating clinical trial eligibility surveillance and study team and provider notification could offer a solution. METHODS: To address this need for an automated solution, we conducted an observational pilot study of our TAES (TriAl Eligibility Surveillance) system. We tested the hypothesis that an automated system based on natural language processing and machine learning algorithms could detect patients eligible for specific clinical trials by linking the information extracted from trial descriptions to the corresponding clinical information in the electronic health record (EHR). To evaluate the TAES information extraction and matching prototype (i.e., TAES prototype), we selected five open cardiovascular and cancer trials at the Medical University of South Carolina and created a new reference standard of 21,974 clinical text notes from a random selection of 400 patients (including at least 100 enrolled in the selected trials), with a small subset of 20 notes annotated in detail. We also developed a simple web interface for a new database that stores all trial eligibility criteria, corresponding clinical information, and trial-patient match characteristics using the Observational Medical Outcomes Partnership (OMOP) common data model. Finally, we investigated options for integrating an automated clinical trial eligibility system into the EHR and for notifying health care providers promptly of potential patient eligibility without interrupting their clinical workflow. RESULTS: Although the rapidly implemented TAES prototype achieved only moderate accuracy (recall up to 0.778; precision up to 1.000), it enabled us to assess options for integrating an automated system successfully into the clinical workflow at a healthcare system. CONCLUSIONS: Once optimized, the TAES system could exponentially enhance identification of patients potentially eligible for clinical trials, while simultaneously decreasing the burden on research teams of manual EHR review. Through timely notifications, it could also raise physician awareness of patient eligibility for clinical trials.

Asunto(s)

Inteligencia Artificial , Procesamiento de Lenguaje Natural , Humanos , Proyectos Piloto , Selección de Paciente , Aprendizaje Automático

3.

Representing and utilizing clinical textual data for real world studies: An OHDSI approach.

Keloth, Vipina K; Banda, Juan M; Gurley, Michael; Heider, Paul M; Kennedy, Georgina; Liu, Hongfang; Liu, Feifan; Miller, Timothy; Natarajan, Karthik; V Patterson, Olga; Peng, Yifan; Raja, Kalpana; Reeves, Ruth M; Rouhizadeh, Masoud; Shi, Jianlin; Wang, Xiaoyan; Wang, Yanshan; Wei, Wei-Qi; Williams, Andrew E; Zhang, Rui; Belenkaya, Rimma; Reich, Christian; Blacketer, Clair; Ryan, Patrick; Hripcsak, George; Elhadad, Noémie; Xu, Hua.

J Biomed Inform ; 142: 104343, 2023 06.

Artículo en Inglés | MEDLINE | ID: mdl-36935011

RESUMEN

Clinical documentation in electronic health records contains crucial narratives and details about patients and their care. Natural language processing (NLP) can unlock the information conveyed in clinical notes and reports, and thus plays a critical role in real-world studies. The NLP Working Group at the Observational Health Data Sciences and Informatics (OHDSI) consortium was established to develop methods and tools to promote the use of textual data and NLP in real-world observational studies. In this paper, we describe a framework for representing and utilizing textual data in real-world evidence generation, including representations of information from clinical text in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), the workflow and tools that were developed to extract, transform and load (ETL) data from clinical notes into tables in OMOP CDM, as well as current applications and specific use cases of the proposed OHDSI NLP solution at large consortia and individual institutions with English textual data. Challenges faced and lessons learned during the process are also discussed to provide valuable insights for researchers who are planning to implement NLP solutions in real-world studies.

Asunto(s)

Ciencia de los Datos , Informática Médica , Humanos , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Narración

4.

A Natural Language Processing Tool Offering Data Extraction for COVID-19 Related Information (DECOVRI).

Heider, Paul M; Pipaliya, Ronak M; Meystre, Stéphane M.

Stud Health Technol Inform ; 290: 1062-1063, 2022 Jun 06.

Artículo en Inglés | MEDLINE | ID: mdl-35673206

RESUMEN

A new natural language processing (NLP) application for COVID-19 related information extraction from clinical text notes is being developed as part of our pandemic response efforts. This NLP application called DECOVRI (Data Extraction for COVID-19 Related Information) will be released as a free and open source tool to convert unstructured notes into structured data within an OMOP CDM-based ecosystem. The DECOVRI prototype is being continuously improved and will be released early (beta) and in a full version.

Asunto(s)

COVID-19 , Procesamiento de Lenguaje Natural , Ecosistema , Registros Electrónicos de Salud , Humanos , Almacenamiento y Recuperación de la Información , Pandemias

5.

Comparing Multiple Models for Section Header Classification with Feature Evaluation.

Pipaliya, Ronak; Heider, Paul M; Meystre, Stéphane M.

Stud Health Technol Inform ; 290: 1064-1065, 2022 Jun 06.

Artículo en Inglés | MEDLINE | ID: mdl-35673207

RESUMEN

We present on the performance evaluation of machine learning (ML) and Natural Language Processing (NLP) based Section Header classification. The section headers classification task was performed as a two-pass system. The first pass detects a section header while the second pass classifies it. Recall, precision, and F1-measure metrics were reported to explore the best approach for ML based section header classification for use in downstream NLP tasks.

Asunto(s)

Aprendizaje Automático , Procesamiento de Lenguaje Natural

6.

A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System.

Kim, Youngjun; Heider, Paul M; Lally, Isabel Rh; Meystre, Stéphane M.

JMIR Med Inform ; 9(4): e22797, 2021 Apr 22.

Artículo en Inglés | MEDLINE | ID: mdl-33885370

RESUMEN

BACKGROUND: Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 National Natural Language Processing Clinical Challenge (n2c2)/Open Health Natural Language Processing (OHNLP) shared task. OBJECTIVE: This task involves identifying mentions of family members and observations in electronic health record text notes and recognizing the 2 types of relations (family member-living status relations and family member-observation relations). Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of 2 subtasks by exploiting additional labeled data and clinical text-based embedding models. METHODS: We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained 2 relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of postchallenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these 2 datasets. We also pretrained language models using clinical notes from the Medical Information Mart for Intensive Care (MIMIC-III) clinical database. RESULTS: The voting ensemble achieved better performance than individual classifiers. In the entity identification task, our top-performing system reached a precision of 78.90% and a recall of 83.84%. Our natural language processing system for entity identification took 3rd place out of 17 teams in the challenge. We ranked 4th out of 9 teams in the relation extraction task. Our system substantially benefited from the combination of the 2 datasets. Compared to our official submission with F1 scores of 81.30% and 64.94% for entity identification and relation extraction, respectively, the revised system yielded significantly better performance (P<.05) with F1 scores of 86.02% and 72.48%, respectively. CONCLUSIONS: We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach to entity identification as a sequence labeling problem produced satisfactory results. Our postchallenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes.

7.

Natural language processing enabling COVID-19 predictive analytics to support data-driven patient advising and pooled testing.

Meystre, Stéphane M; Heider, Paul M; Kim, Youngjun; Davis, Matthew; Obeid, Jihad; Madory, James; Alekseyenko, Alexander V.

J Am Med Inform Assoc ; 29(1): 12-21, 2021 12 28.

Artículo en Inglés | MEDLINE | ID: mdl-34415311

RESUMEN

OBJECTIVE: The COVID-19 (coronavirus disease 2019) pandemic response at the Medical University of South Carolina included virtual care visits for patients with suspected severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. The telehealth system used for these visits only exports a text note to integrate with the electronic health record, but structured and coded information about COVID-19 (eg, exposure, risk factors, symptoms) was needed to support clinical care and early research as well as predictive analytics for data-driven patient advising and pooled testing. MATERIALS AND METHODS: To capture COVID-19 information from multiple sources, a new data mart and a new natural language processing (NLP) application prototype were developed. The NLP application combined reused components with dictionaries and rules crafted by domain experts. It was deployed as a Web service for hourly processing of new data from patients assessed or treated for COVID-19. The extracted information was then used to develop algorithms predicting SARS-CoV-2 diagnostic test results based on symptoms and exposure information. RESULTS: The dedicated data mart and NLP application were developed and deployed in a mere 10-day sprint in March 2020. The NLP application was evaluated with good accuracy (85.8% recall and 81.5% precision). The SARS-CoV-2 testing predictive analytics algorithms were configured to provide patients with data-driven COVID-19 testing advices with a sensitivity of 81% to 92% and to enable pooled testing with a negative predictive value of 90% to 91%, reducing the required tests to about 63%. CONCLUSIONS: SARS-CoV-2 testing predictive analytics and NLP successfully enabled data-driven patient advising and pooled testing.

Asunto(s)

COVID-19 , Prueba de COVID-19 , Humanos , Procesamiento de Lenguaje Natural , Pandemias , SARS-CoV-2

8.

Comparative Study of Various Approaches for Ensemble-based De-identification of Electronic Health Record Narratives.

Kim, Youngjun; Heider, Paul M; Meystre, Stéphane M.

AMIA Annu Symp Proc ; 2020: 648-657, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-33936439

RESUMEN

De-identification of electric health record narratives is a fundamental task applying natural language processing to better protect patient information privacy. We explore different types of ensemble learning methods to improve clinical text de-identification. We present two ensemble-based approaches for combining multiple predictive models. The first method selects an optimal subset of de-identification models by greedy exclusion. This ensemble pruning allows one to save computational time or physical resources while achieving similar or better performance than the ensemble of all members. The second method uses a sequence of words to train a sequential model. For this sequence labelling-based stacked ensemble, we employ search-based structured prediction and bidirectional long short-term memory algorithms. We create ensembles consisting of de-identification models trained on two clinical text corpora. Experimental results show that our ensemble systems can effectively integrate predictions from individual models and offer better generalization across two different corpora.

Asunto(s)

Registros Electrónicos de Salud , Algoritmos , Confidencialidad , Anonimización de la Información , Humanos , Narración , Procesamiento de Lenguaje Natural , Privacidad

9.

A Comparative Analysis of Speed and Accuracy for Three Off-the-Shelf De-Identification Tools.

Heider, Paul M; Obeid, Jihad S; Meystre, Stéphane M.

AMIA Jt Summits Transl Sci Proc ; 2020: 241-250, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-32477643

RESUMEN

A growing quantity of health data is being stored in Electronic Health Records (EHR). The free-text section of these clinical notes contains important patient and treatment information for research but also contains Personally Identifiable Information (PII), which cannot be freely shared within the research community without compromising patient confidentiality and privacy rights. Significant work has been invested in investigating automated approaches to text de-identification, the process of removing or redacting PII. Few studies have examined the performance of existing de-identification pipelines in a controlled comparative analysis. In this study, we use publicly available corpora to analyze speed and accuracy differences between three de-identification systems that can be run off-the-shelf: Amazon Comprehend Medical PHId, Clinacuity's CliniDeID, and the National Library of Medicine's Scrubber. No single system dominated all the compared metrics. NLM Scrubber was the fastest while CliniDeID generally had the highest accuracy.

10.

An artificial intelligence approach to COVID-19 infection risk assessment in virtual visits: A case report.

Obeid, Jihad S; Davis, Matthew; Turner, Matthew; Meystre, Stephane M; Heider, Paul M; O'Bryan, Edward C; Lenert, Leslie A.

J Am Med Inform Assoc ; 27(8): 1321-1325, 2020 08 01.

Artículo en Inglés | MEDLINE | ID: mdl-32449766

RESUMEN

OBJECTIVE: In an effort to improve the efficiency of computer algorithms applied to screening for coronavirus disease 2019 (COVID-19) testing, we used natural language processing and artificial intelligence-based methods with unstructured patient data collected through telehealth visits. MATERIALS AND METHODS: After segmenting and parsing documents, we conducted analysis of overrepresented words in patient symptoms. We then developed a word embedding-based convolutional neural network for predicting COVID-19 test results based on patients' self-reported symptoms. RESULTS: Text analytics revealed that concepts such as smell and taste were more prevalent than expected in patients testing positive. As a result, screening algorithms were adapted to include these symptoms. The deep learning model yielded an area under the receiver-operating characteristic curve of 0.729 for predicting positive results and was subsequently applied to prioritize testing appointment scheduling. CONCLUSIONS: Informatics tools such as natural language processing and artificial intelligence methods can have significant clinical impacts when applied to data streams early in the development of clinical systems for outbreak response.

Asunto(s)

Inteligencia Artificial , Infecciones por Coronavirus/diagnóstico , Procesamiento de Lenguaje Natural , Neumonía Viral/diagnóstico , Telemedicina , Algoritmos , Betacoronavirus , COVID-19 , Prueba de COVID-19 , Técnicas de Laboratorio Clínico , Aprendizaje Profundo , Registros Electrónicos de Salud , Humanos , Redes Neurales de la Computación , Estudios de Casos Organizacionales , Pandemias , Curva ROC , Medición de Riesgo , SARS-CoV-2 , South Carolina

11.

Patient-Pivoted Automated Trial Eligibility Pipeline: The First of Three Phases in a Modular Architecture.

Heider, Paul M; Meystre, Stéphane M.

Stud Health Technol Inform ; 264: 1476-1477, 2019 Aug 21.

Artículo en Inglés | MEDLINE | ID: mdl-31438189

RESUMEN

Automated extraction of patient trial eligibility for clinical research studies can increase enrollment at a decreased time and money cost. We have developed a modular trial eligibility pipeline including patient-batched processing and an internal webservice backed by a uimaFIT pipeline as part of a multi-phase approach to include note-batched processing, the ability to query trials matching patients or patients matching trials, and an external alignment engine to connect patients to trials.

Asunto(s)

Determinación de la Elegibilidad , Costos y Análisis de Costo , Humanos , Selección de Paciente

12.

Automatic trial eligibility surveillance based on unstructured clinical data.

Meystre, Stéphane M; Heider, Paul M; Kim, Youngjun; Aruch, Daniel B; Britten, Carolyn D.

Int J Med Inform ; 129: 13-19, 2019 09.

Artículo en Inglés | MEDLINE | ID: mdl-31445247

RESUMEN

INTRODUCTION: Insufficient patient enrollment in clinical trials remains a serious and costly problem and is often considered the most critical issue to solve for the clinical trials community. In this project, we assessed the feasibility of automatically detecting a patient's eligibility for a sample of breast cancer clinical trials by mapping coded clinical trial eligibility criteria to the corresponding clinical information automatically extracted from text in the EHR. METHODS: Three open breast cancer clinical trials were selected by oncologists. Their eligibility criteria were manually abstracted from trial descriptions using the OHDSI ATLAS web application. Patients enrolled or screened for these trials were selected as 'positive' or 'possible' cases. Other patients diagnosed with breast cancer were selected as 'negative' cases. A selection of the clinical data and all clinical notes of these 229 selected patients was extracted from the MUSC clinical data warehouse and stored in a database implementing the OMOP common data model. Eligibility criteria were extracted from clinical notes using either manually crafted pattern matching (regular expressions) or a new natural language processing (NLP) application. These extracted criteria were then compared with reference criteria from trial descriptions. This comparison was realized with three different versions of a new application: rule-based, cosine similarity-based, and machine learning-based. RESULTS: For eligibility criteria extraction from clinical notes, the machine learning-based NLP application allowed for the highest accuracy with a micro-averaged recall of 90.9% and precision of 89.7%. For trial eligibility determination, the highest accuracy was reached by the machine learning-based approach with a per-trial AUC between 75.5% and 89.8%. CONCLUSION: NLP can be used to extract eligibility criteria from EHR clinical notes and automatically discover patients possibly eligible for a clinical trial with good accuracy, which could be leveraged to reduce the workload of humans screening patients for trials.

Asunto(s)

Determinación de la Elegibilidad , Automatización , Neoplasias de la Mama , Data Warehousing , Bases de Datos Factuales , Femenino , Humanos , Aprendizaje Automático , Masculino , Persona de Mediana Edad , Procesamiento de Lenguaje Natural , Selección de Paciente , Carga de Trabajo

13.

Impact of De-Identification on Clinical Text Classification Using Traditional and Deep Learning Classifiers.

Obeid, Jihad S; Heider, Paul M; Weeda, Erin R; Matuskowitz, Andrew J; Carr, Christine M; Gagnon, Kevin; Crawford, Tami; Meystre, Stephane M.

Stud Health Technol Inform ; 264: 283-287, 2019 Aug 21.

Artículo en Inglés | MEDLINE | ID: mdl-31437930

RESUMEN

Clinical text de-identification enables collaborative research while protecting patient privacy and confidentiality; however, concerns persist about the reduction in the utility of the de-identified text for information extraction and machine learning tasks. In the context of a deep learning experiment to detect altered mental status in emergency department provider notes, we tested several classifiers on clinical notes in their original form and on their automatically de-identified counterpart. We tested both traditional bag-of-words based machine learning models as well as word-embedding based deep learning models. We evaluated the models on 1,113 history of present illness notes. A total of 1,795 protected health information tokens were replaced in the de-identification process across all notes. The deep learning models had the best performance with accuracies of 95% on both original and de-identified notes. However, there was no significant difference in the performance of any of the models on the original vs. the de-identified notes.

Asunto(s)

Anonimización de la Información , Aprendizaje Profundo , Confidencialidad , Registros Electrónicos de Salud , Humanos , Aprendizaje Automático

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA