Comparing information extraction techniques for low-prevalence concepts: The case of insulin rejection by patients.

Malmasi, Shervin; Ge, Wendong; Hosomura, Naoshi; Turchin, Alexander

Malmasi, Shervin; Ge, Wendong; Hosomura, Naoshi; Turchin, Alexander.

Afiliação

Malmasi S; Division of Endocrinology, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
Ge W; Division of Endocrinology, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
Hosomura N; Division of Endocrinology, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
Turchin A; Division of Endocrinology, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA. Electronic address: aturchin@bwh.harvard.edu.

J Biomed Inform ; 99: 103306, 2019 11.

Article em En | MEDLINE | ID: mdl-31618679

ABSTRACT

ABSTRACT

OBJECTIVE:

To comparatively evaluate a range of Natural Language Processing (NLP) approaches for Information Extraction (IE) of low-prevalence concepts in clinical notes on the example of decline of insulin therapy recommendation by patients. MATERIALS AND

METHODS:

We evaluated the accuracy of detection of documentation of decline of insulin therapy by patients using sentence-level naïve Bayes, logistic regression and support vector machine (SVM)-based classification (with and without SMOTE oversampling), token-level sequence labelling using conditional random fields (CRFs), uni- and bi-directional recurrent neural network (RNN) models with GRU and LSTM cells, and rule-based detection using Canary platform. All models were trained using the same manually annotated 50,046-document training set and evaluated on the same 1501-document held-out set. Hyperparameter optimization was performed using 10-fold cross-validation.

RESULTS:

At the sentence level, prevalence of documentation of decline of insulin therapy by patients was 0.02% in both training and held-out sets. Naïve Bayes and logistic regression models did not achieve F1 scoreâ¯≥â¯0.5 on the training set and were not further evaluated. Among the other models, evaluation against the held-out test set showed that SVM identified decline of insulin therapy by patients with F1 score of 0.61, CRF with F1 of 0.51, RNN with F1 of 0.67 and Canary rule-based model with F1 of 0.97.

CONCLUSIONS:

Identification of low-prevalence concepts can present challenges in medical language processing. Rule-based systems that include the designer's background knowledge of language may be able to achieve higher accuracy under these circumstances.

Assuntos

Mineração de Dados/métodos; Registros Eletrônicos de Saúde; Insulina/uso terapêutico; Processamento de Linguagem Natural; Recusa do Paciente ao Tratamento/estatística & dados numéricos; Diabetes Mellitus/tratamento farmacológico; Humanos; Hipoglicemiantes/uso terapêutico; Redes Neurais de Computação; Máquina de Vetores de Suporte; Interface Usuário-Computador

Palavras-chave

Conditional random fields; Insulin; Natural language processing; Recurrent neural networks; Support vector machine

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural / Recusa do Paciente ao Tratamento / Registros Eletrônicos de Saúde / Mineração de Dados / Insulina Tipo de estudo: Guideline / Prevalence_studies / Prognostic_studies / Risk_factors_studies Limite: Humans Idioma: En Ano de publicação: 2019 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google