Identification of patients' smoking status using an explainable AI approach: a Danish electronic health records case study.

Ebrahimi, Ali; Henriksen, Margrethe Bang Høstgaard; Brasen, Claus Lohman; Hilberg, Ole; Hansen, Torben Frøstrup; Jensen, Lars Henrik; Peimankar, Abdolrahman; Wiil, Uffe Kock

Ebrahimi, Ali; Henriksen, Margrethe Bang Høstgaard; Brasen, Claus Lohman; Hilberg, Ole; Hansen, Torben Frøstrup; Jensen, Lars Henrik; Peimankar, Abdolrahman; Wiil, Uffe Kock.

Afiliação

Ebrahimi A; SDU Health Informatics and Technology, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, 5230, Denmark. aleb@mmmi.sdu.dk.
Henriksen MBH; Department of Oncology, Lillebaelt Hospital, University Hospital of Southern Denmark, Vejle, 7100, Denmark.
Brasen CL; Department of Biochemistry and Immunology, Lillebaelt Hospital, University Hospital of Southern Denmark, Vejle, 7100, Denmark.
Hilberg O; Institute of Regional Health Research, University of Southern Denmark, Odense, Denmark.
Hansen TF; Institute of Regional Health Research, University of Southern Denmark, Odense, Denmark.
Jensen LH; Department of Internal Medicine, Lillebaelt Hospital, University Hospital of Southern Denmark, Vejle, 7100, Denmark.
Peimankar A; Department of Oncology, Lillebaelt Hospital, University Hospital of Southern Denmark, Vejle, 7100, Denmark.
Wiil UK; Institute of Regional Health Research, University of Southern Denmark, Odense, Denmark.

BMC Med Res Methodol ; 24(1): 114, 2024 May 17.

Article em En | MEDLINE | ID: mdl-38760718

ABSTRACT

ABSTRACT

BACKGROUND:

Smoking is a critical risk factor responsible for over eight million annual deaths worldwide. It is essential to obtain information on smoking habits to advance research and implement preventive measures such as screening of high-risk individuals. In most countries, including Denmark, smoking habits are not systematically recorded and at best documented within unstructured free-text segments of electronic health records (EHRs). This would require researchers and clinicians to manually navigate through extensive amounts of unstructured data, which is one of the main reasons that smoking habits are rarely integrated into larger studies. Our aim is to develop machine learning models to classify patients' smoking status from their EHRs.

METHODS:

This study proposes an efficient natural language processing (NLP) pipeline capable of classifying patients' smoking status and providing explanations for the decisions. The proposed NLP pipeline comprises four distinct components, which are; (1) considering preprocessing techniques to address abbreviations, punctuation, and other textual irregularities, (2) four cutting-edge feature extraction techniques, i.e. Embedding, BERT, Word2Vec, and Count Vectorizer, employed to extract the optimal features, (3) utilization of a Stacking-based Ensemble (SE) model and a Convolutional Long Short-Term Memory Neural Network (CNN-LSTM) for the identification of smoking status, and (4) application of a local interpretable model-agnostic explanation to explain the decisions rendered by the detection models. The EHRs of 23,132 patients with suspected lung cancer were collected from the Region of Southern Denmark during the period 1/1/2009-31/12/2018. A medical professional annotated the data into 'Smoker' and 'Non-Smoker' with further classifications as 'Active-Smoker', 'Former-Smoker', and 'Never-Smoker'. Subsequently, the annotated dataset was used for the development of binary and multiclass classification models. An extensive comparison was conducted of the detection performance across various model architectures.

RESULTS:

The results of experimental validation confirm the consistency among the models. However, for binary classification, BERT method with CNN-LSTM architecture outperformed other models by achieving precision, recall, and F1-scores between 97% and 99% for both Never-Smokers and Active-Smokers. In multiclass classification, the Embedding technique with CNN-LSTM architecture yielded the most favorable results in class-specific evaluations, with equal performance measures of 97% for Never-Smoker and measures in the range of 86 to 89% for Active-Smoker and 91-92% for Never-Smoker.

CONCLUSION:

Our proposed NLP pipeline achieved a high level of classification performance. In addition, we presented the explanation of the decision made by the best performing detection model. Future work will expand the model's capabilities to analyze longer notes and a broader range of categories to maximize its utility in further research and screening applications.

Assuntos

Registros Eletrônicos de Saúde; Processamento de Linguagem Natural; Fumar; Humanos; Dinamarca/epidemiologia; Registros Eletrônicos de Saúde/estatística & dados numéricos; Fumar/epidemiologia; Aprendizado de Máquina; Feminino; Masculino; Pessoa de Meia-Idade; Redes Neurais de Computação

Palavras-chave

CNN; Deep learning; Electronic health record; Explainable Artificial Intelligence (XAI); LSTM; Natural language processing; Smoking status; Stacking-based ensemble; Text classification

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Fumar / Registros Eletrônicos de Saúde Limite: Female / Humans / Male / Middle aged País/Região como assunto: Europa Idioma: En Revista: BMC Med Res Methodol Assunto da revista: MEDICINA Ano de publicação: 2024 Tipo de documento: Article País de afiliação: Dinamarca País de publicação: Reino Unido

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google