Your browser doesn't support javascript.
loading
An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge.
Shi, Xi; Prins, Charlotte; Van Pottelbergh, Gijs; Mamouris, Pavlos; Vaes, Bert; De Moor, Bart.
Afiliação
  • Shi X; Department of Electrical Engineering (ESAT), Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Kasteelpark Arenberg 10 - Box 2446, 3001, Leuven, Belgium. xi.shi@esat.kuleuven.be.
  • Prins C; Leuven Statistics Research Center, KU Leuven, 3000, Leuven, Belgium.
  • Van Pottelbergh G; Academic Center for General Practice, KU Leuven, 3000, Leuven, Belgium.
  • Mamouris P; Academic Center for General Practice, KU Leuven, 3000, Leuven, Belgium.
  • Vaes B; Academic Center for General Practice, KU Leuven, 3000, Leuven, Belgium.
  • De Moor B; Department of Electrical Engineering (ESAT), Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Kasteelpark Arenberg 10 - Box 2446, 3001, Leuven, Belgium.
BMC Med Inform Decis Mak ; 21(1): 267, 2021 09 17.
Article em En | MEDLINE | ID: mdl-34535146
ABSTRACT

BACKGROUND:

The use of Electronic Health Records (EHR) data in clinical research is incredibly increasing, but the abundancy of data resources raises the challenge of data cleaning. It can save time if the data cleaning can be done automatically. In addition, the automated data cleaning tools for data in other domains often process all variables uniformly, meaning that they cannot serve well for clinical data, as there is variable-specific information that needs to be considered. This paper proposes an automated data cleaning method for EHR data with clinical knowledge taken into consideration.

METHODS:

We used EHR data collected from primary care in Flanders, Belgium during 1994-2015. We constructed a Clinical Knowledge Database to store all the variable-specific information that is necessary for data cleaning. We applied Fuzzy search to automatically detect and replace the wrongly spelled units, and performed the unit conversion following the variable-specific conversion formula. Then the numeric values were corrected and outliers were detected considering the clinical knowledge. In total, 52 clinical variables were cleaned, and the percentage of missing values (completeness) and percentage of values within the normal range (correctness) before and after the cleaning process were compared.

RESULTS:

All variables were 100% complete before data cleaning. 42 variables had a drop of less than 1% in the percentage of missing values and 9 variables declined by 1-10%. Only 1 variable experienced large decline in completeness (13.36%). All variables had more than 50% values within the normal range after cleaning, of which 43 variables had a percentage higher than 70%.

CONCLUSIONS:

We propose a general method for clinical variables, which achieves high automation and is capable to deal with large-scale data. This method largely improved the efficiency to clean the data and removed the technical barriers for non-technical people.
Assuntos
Palavras-chave

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Atenção Primária à Saúde / Registros Eletrônicos de Saúde Tipo de estudo: Prognostic_studies Limite: Humans País como assunto: Europa Idioma: En Ano de publicação: 2021 Tipo de documento: Article

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Atenção Primária à Saúde / Registros Eletrônicos de Saúde Tipo de estudo: Prognostic_studies Limite: Humans País como assunto: Europa Idioma: En Ano de publicação: 2021 Tipo de documento: Article