The role of fine-grained annotations in supervised recognition of risk factors for heart disease from EHRs.

Roberts, Kirk; Shooshan, Sonya E; Rodriguez, Laritza; Abhyankar, Swapna; Kilicoglu, Halil; Demner-Fushman, Dina

Roberts, Kirk; Shooshan, Sonya E; Rodriguez, Laritza; Abhyankar, Swapna; Kilicoglu, Halil; Demner-Fushman, Dina.

Afiliação

Roberts K; Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA. Electronic address: kirk.roberts@nih.gov.
Shooshan SE; Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Rodriguez L; Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Abhyankar S; Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Kilicoglu H; Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Demner-Fushman D; Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

J Biomed Inform ; 58 Suppl: S111-S119, 2015 Dec.

Article em En | MEDLINE | ID: mdl-26122527

ABSTRACT

ABSTRACT

This paper describes a supervised machine learning approach for identifying heart disease risk factors in clinical text, and assessing the impact of annotation granularity and quality on the system's ability to recognize these risk factors. We utilize a series of support vector machine models in conjunction with manually built lexicons to classify triggers specific to each risk factor. The features used for classification were quite simple, utilizing only lexical information and ignoring higher-level linguistic information such as syntax and semantics. Instead, we incorporated high-quality data to train the models by annotating additional information on top of a standard corpus. Despite the relative simplicity of the system, it achieves the highest scores (micro- and macro-F1, and micro- and macro-recall) out of the 20 participants in the 2014 i2b2/UTHealth Shared Task. This system obtains a micro- (macro-) precision of 0.8951 (0.8965), recall of 0.9625 (0.9611), and F1-measure of 0.9276 (0.9277). Additionally, we perform a series of experiments to assess the value of the annotated data we created. These experiments show how manually-labeled negative annotations can improve information extraction performance, demonstrating the importance of high-quality, fine-grained natural language annotations.

Assuntos

Doença da Artéria Coronariana/epidemiologia; Mineração de Dados/métodos; Complicações do Diabetes/epidemiologia; Registros Eletrônicos de Saúde/organização & administração; Processamento de Linguagem Natural; Aprendizado de Máquina Supervisionado; Idoso; Estudos de Coortes; Comorbidade; Segurança Computacional; Confidencialidade; Doença da Artéria Coronariana/diagnóstico; Complicações do Diabetes/diagnóstico; Feminino; Humanos; Incidência; Estudos Longitudinais; Masculino; Maryland/epidemiologia; Pessoa de Meia-Idade; Narração; Reconhecimento Automatizado de Padrão/métodos; Medição de Risco/métodos; Vocabulário Controlado

Palavras-chave

Machine learning; Natural language annotation; Natural language processing

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Doença da Artéria Coronariana / Processamento de Linguagem Natural / Complicações do Diabetes / Registros Eletrônicos de Saúde / Mineração de Dados / Aprendizado de Máquina Supervisionado Tipo de estudo: Diagnostic_studies / Etiology_studies / Incidence_studies / Observational_studies / Prognostic_studies / Qualitative_research / Risk_factors_studies Limite: Aged / Female / Humans / Male / Middle aged País/Região como assunto: America do norte Idioma: En Revista: J Biomed Inform Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2015 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google