A multi-source heterogeneous medical data enhancement framework based on lakehouse.
Health Inf Sci Syst
; 12(1): 37, 2024 Dec.
Article
de En
| MEDLINE
| ID: mdl-38974364
ABSTRACT
Obtaining high-quality data sets from raw data is a key step before data exploration and analysis. Nowadays, in the medical domain, a large amount of data is in need of quality improvement before being used to analyze the health condition of patients. There have been many researches in data extraction, data cleaning and data imputation, respectively. However, there are seldom frameworks integrating with these three techniques, making the dataset suffer in accuracy, consistency and integrity. In this paper, a multi-source heterogeneous data enhancement framework based on a lakehouse MHDP is proposed, which includes three steps of data extraction, data cleaning and data imputation. In the data extraction step, a data fusion technique is offered to handle multi-modal and multi-source heterogeneous data. In the data cleaning step, we propose HoloCleanX, which provides a convenient interactive procedure. In the data imputation step, multiple imputation (MI) and the SOTA algorithm SAITS, are applied for different situations. We evaluate our framework via three tasks clustering, classification and strategy prediction. The experimental results prove the effectiveness of our data enhancement framework.
Texte intégral:
1
Collection:
01-internacional
Base de données:
MEDLINE
Langue:
En
Journal:
Health Inf Sci Syst
Année:
2024
Type de document:
Article
Pays de publication:
Royaume-Uni