Your browser doesn't support javascript.
loading
A multi-source heterogeneous medical data enhancement framework based on lakehouse.
Sheng, Ming; Wang, Shuliang; Zhang, Yong; Hao, Rui; Liang, Ye; Luo, Yi; Yang, Wenhan; Wang, Jincheng; Li, Yinan; Zheng, Wenkui; Li, Wenyao.
Afiliação
  • Sheng M; School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081 China.
  • Wang S; School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081 China.
  • Zhang Y; BNRist, DCST, RIIT, Tsinghua University, Beijing, 100084 China.
  • Hao R; School of Computer Science, International School, Beijing University of Posts and Telecommunications, Haidian District, Beijing, 100876 China.
  • Liang Y; School of Information Science and Technology, Beijing Foreign Studies University, Beijing, China.
  • Luo Y; School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081 China.
  • Yang W; School of Computer Science, International School, Beijing University of Posts and Telecommunications, Haidian District, Beijing, 100876 China.
  • Wang J; School of Computer and Information Technology, Beijing Jiaotong University, Haidian District, Beijing, 100044 China.
  • Li Y; Dam Safety Monitoring Center, Yellow River Engineering Consulting Co., Ltd, Zhengzhou, 450000 China.
  • Zheng W; School of Software, Henan University, Kaifeng, 475004 Henan China.
  • Li W; School of Software, Henan University, Kaifeng, 475004 Henan China.
Health Inf Sci Syst ; 12(1): 37, 2024 Dec.
Article em En | MEDLINE | ID: mdl-38974364
ABSTRACT
Obtaining high-quality data sets from raw data is a key step before data exploration and analysis. Nowadays, in the medical domain, a large amount of data is in need of quality improvement before being used to analyze the health condition of patients. There have been many researches in data extraction, data cleaning and data imputation, respectively. However, there are seldom frameworks integrating with these three techniques, making the dataset suffer in accuracy, consistency and integrity. In this paper, a multi-source heterogeneous data enhancement framework based on a lakehouse MHDP is proposed, which includes three steps of data extraction, data cleaning and data imputation. In the data extraction step, a data fusion technique is offered to handle multi-modal and multi-source heterogeneous data. In the data cleaning step, we propose HoloCleanX, which provides a convenient interactive procedure. In the data imputation step, multiple imputation (MI) and the SOTA algorithm SAITS, are applied for different situations. We evaluate our framework via three tasks clustering, classification and strategy prediction. The experimental results prove the effectiveness of our data enhancement framework.
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Health Inf Sci Syst Ano de publicação: 2024 Tipo de documento: Article País de publicação: Reino Unido

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Health Inf Sci Syst Ano de publicação: 2024 Tipo de documento: Article País de publicação: Reino Unido