Optimized dual threshold entity resolution for electronic health record databases--training set size and active learning.

Joffe, Erel; Byrne, Michael J; Reeder, Phillip; Herskovic, Jorge R; Johnson, Craig W; McCoy, Allison B; Bernstam, Elmer V

Joffe, Erel; Byrne, Michael J; Reeder, Phillip; Herskovic, Jorge R; Johnson, Craig W; McCoy, Allison B; Bernstam, Elmer V.

Afiliación

Joffe E; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX.
Byrne MJ; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX.
Reeder P; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX.
Herskovic JR; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX. ; Department of Bioinformatics and Computational Biology, MD Anderson Cancer Center, Houston, TX.
Johnson CW; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX.
McCoy AB; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX. ; UT Houston - Memorial Hermann Center for Healthcare Quality & Safety, Houston, TX.
Bernstam EV; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX. ; Division of General Internal Medicine, Department of Internal Medicine, Medical School, The University of Texas Health Science Center at Houston, Houston, TX.

AMIA Annu Symp Proc ; 2013: 721-30, 2013.

Article en En | MEDLINE | ID: mdl-24551372

RESUMEN

Clinical databases may contain several records for a single patient. Multiple general entity-resolution algorithms have been developed to identify such duplicate records. To achieve optimal accuracy, algorithm parameters must be tuned to a particular dataset. The purpose of this study was to determine the required training set size for probabilistic, deterministic and Fuzzy Inference Engine (FIE) algorithms with parameters optimized using the particle swarm approach. Each algorithm classified potential duplicates into: definite match, non-match and indeterminate (i.e., requires manual review). Training sets size ranged from 2,000-10,000 randomly selected record-pairs. We also evaluated marginal uncertainty sampling for active learning. Optimization reduced manual review size (Deterministic 11.6% vs. 2.5%; FIE 49.6% vs. 1.9%; and Probabilistic 10.5% vs. 3.5%). FIE classified 98.1% of the records correctly (precision=1.0). Best performance required training on all 10,000 randomly-selected record-pairs. Active learning achieved comparable results with 3,000 records. Automated optimization is effective and targeted sampling can reduce the required training set size.

Asunto(s)

Algoritmos; Inteligencia Artificial; Registros Electrónicos de Salud; Lógica Difusa

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Algoritmos / Inteligencia Artificial / Registros Electrónicos de Salud Tipo de estudio: Guideline / Prognostic_studies Idioma: En Revista: AMIA Annu Symp Proc Asunto de la revista: INFORMATICA MEDICA Año: 2013 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google