Your browser doesn't support javascript.
loading
Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center.
Badalotti, Davide; Agrawal, Akanksha; Pensato, Umberto; Angelotti, Giovanni; Marcheselli, Simona.
Afiliação
  • Badalotti D; Department of Computing Sciences, Bocconi University, Milano, Italy; Artificial Intelligence Center, Humanitas Clinical and Research Center - IRCCS, Via A. Manzoni 56, Rozzano 20089, Milan, Italy. Electronic address: d.badalotti@campus.unimib.it.
  • Agrawal A; Department of Biomedical Sciences, Humanitas University, via Rita Levi Montalcini 4, 20072 Pieve Emanuele, Milan, Italy.
  • Pensato U; Department of Biomedical Sciences, Humanitas University, via Rita Levi Montalcini 4, 20072 Pieve Emanuele, Milan, Italy; IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano, Milan, Italy.
  • Angelotti G; Artificial Intelligence Center, Humanitas Clinical and Research Center - IRCCS, Via A. Manzoni 56, Rozzano 20089, Milan, Italy.
  • Marcheselli S; IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano, Milan, Italy.
Int J Med Inform ; 192: 105626, 2024 Sep 19.
Article em En | MEDLINE | ID: mdl-39321491
ABSTRACT

INTRODUCTION:

Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients' medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort. MATERIALS AND

METHODS:

Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients' comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients' functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.

RESULTS:

The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.

CONCLUSIONS:

We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article