DREAMER: a computational framework to evaluate readiness of datasets for machine learning.

Ahangaran, Meysam; Zhu, Hanzhi; Li, Ruihui; Yin, Lingkai; Jang, Joseph; Chaudhry, Arnav P; Farrer, Lindsay A; Au, Rhoda; Kolachalama, Vijaya B

Ahangaran, Meysam; Zhu, Hanzhi; Li, Ruihui; Yin, Lingkai; Jang, Joseph; Chaudhry, Arnav P; Farrer, Lindsay A; Au, Rhoda; Kolachalama, Vijaya B.

Afiliação

Ahangaran M; Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Zhu H; Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Li R; Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Yin L; Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Jang J; Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Chaudhry AP; Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Farrer LA; Department of Neurology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Au R; Department Ophthalmology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
Kolachalama VB; Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA.

BMC Med Inform Decis Mak ; 24(1): 152, 2024 Jun 04.

Article em En | MEDLINE | ID: mdl-38831432

ABSTRACT

ABSTRACT

BACKGROUND:

Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community..

RESULTS:

The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies.

CONCLUSION:

Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

Assuntos

Aprendizado de Máquina; Humanos; Conjuntos de Dados como Assunto; Aprendizado de Máquina não Supervisionado; Algoritmos; Aprendizado de Máquina Supervisionado; Software

Palavras-chave

Data quality measure; Data readiness; Feature engineering; Machine learning

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Aprendizado de Máquina Limite: Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Aprendizado de Máquina Limite: Humans Idioma: En Ano de publicação: 2024 Tipo de documento: Article