Improving record linkage performance in the presence of missing linkage data.

Ong, Toan C; Mannino, Michael V; Schilling, Lisa M; Kahn, Michael G

Ong, Toan C; Mannino, Michael V; Schilling, Lisa M; Kahn, Michael G.

Afiliação

Ong TC; University of Colorado, Denver, Business School, Denver, CO, USA; Department of Medicine, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA; Colorado Clinical and Translational Sciences Institute, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA. El
Mannino MV; University of Colorado, Denver, Business School, Denver, CO, USA.
Schilling LM; Department of Medicine, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.
Kahn MG; Department of Pediatrics, School of Medicine, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA; Colorado Clinical and Translational Sciences Institute, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA.

J Biomed Inform ; 52: 43-54, 2014 Dec.

Article em En | MEDLINE | ID: mdl-24524889

RESUMO

INTRODUCTION: Existing record linkage methods do not handle missing linking field values in an efficient and effective manner. The objective of this study is to investigate three novel methods for improving the accuracy and efficiency of record linkage when record linkage fields have missing values. METHODS: By extending the Fellegi-Sunter scoring implementations available in the open-source Fine-grained Record Linkage (FRIL) software system we developed three novel methods to solve the missing data problem in record linkage, which we refer to as: Weight Redistribution, Distance Imputation, and Linkage Expansion. Weight Redistribution removes fields with missing data from the set of quasi-identifiers and redistributes the weight from the missing attribute based on relative proportions across the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields rather than imputing the missing data value. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in a linkage field. We tested the linkage methods using simulated data sets with varying field value corruption rates. RESULTS: The methods developed had sensitivity ranging from .895 to .992 and positive predictive values (PPV) ranging from .865 to 1 in data sets with low corruption rates. Increased corruption rates lead to decreased sensitivity for all methods. CONCLUSIONS: These new record linkage algorithms show promise in terms of accuracy and efficiency and may be valuable for combining large data sets at the patient level to support biomedical and clinical research.

Assuntos

Pesquisa Biomédica/métodos; Pesquisa Biomédica/normas; Informática Médica; Registro Médico Coordenado/métodos; Registro Médico Coordenado/normas; Algoritmos; Humanos; Projetos de Pesquisa

Palavras-chave

Comparative effectiveness research; Data quality; Missing data; Quasi-identifiers; Record linkage

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Informática Médica / Registro Médico Coordenado / Pesquisa Biomédica Tipo de estudo: Prognostic_studies Limite: Humans Idioma: En Ano de publicação: 2014 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google