Efficient learning from big data for cancer risk modeling: A case study with melanoma.

Richter, Aaron N; Khoshgoftaar, Taghi M

Richter, Aaron N; Khoshgoftaar, Taghi M.

Afiliación

Richter AN; Department of Computer & Electrical Engineering and Computer Science College of Engineering and Computer Science, Florida Atlantic University, 777 Glades Road EE 403, Boca Raton, FL, 33431-0991, USA. Electronic address: arichter@fau.edu.
Khoshgoftaar TM; Department of Computer & Electrical Engineering and Computer Science College of Engineering and Computer Science, Florida Atlantic University, 777 Glades Road EE 403, Boca Raton, FL, 33431-0991, USA. Electronic address: khoshgof@fau.edu.

Comput Biol Med ; 110: 29-39, 2019 07.

Article en En | MEDLINE | ID: mdl-31112896

ABSTRACT

ABSTRACT

BACKGROUND:

Building cancer risk models from real-world data requires overcoming challenges in data preprocessing, efficient representation, and computational performance. We present a case study of a cloud-based approach to learning from de-identified electronic health record data and demonstrate its effectiveness for melanoma risk prediction.

METHODS:

We used a hybrid distributed and non-distributed approach to computing in the cloud distributed processing with Apache Spark for data preprocessing and labeling, and non-distributed processing for machine learning model training with scikit-learn. Moreover, we explored the effects of sampling the training dataset to improve computational performance. Risk factors were evaluated using regression weights as well as tree SHAP values.

RESULTS:

Among 4,061,172 patients who did not have melanoma through the 2016 calendar year, 10,129 were diagnosed with melanoma within one year. A gradient-boosted classifier achieved the best predictive performance with cross-validation (AUCâ¯=â¯0.799, Sensitivityâ¯=â¯0.753, Specificityâ¯=â¯0.688). Compared to a model built on the original data, a dataset two orders of magnitude smaller could achieve statistically similar or better performance with less than 1% of the training time and cost.

CONCLUSIONS:

We produced a model that can effectively predict melanoma risk for a diverse dermatology population in the U.S. by using hybrid computing infrastructure and data sampling. For this de-identified clinical dataset, sampling approaches significantly shortened the time for model building while retaining predictive accuracy, allowing for more rapid machine learning model experimentation on familiar computing machinery. A large number of risk factors (>300) were required to produce the best model.

Asunto(s)

Macrodatos; Registros Electrónicos de Salud; Aprendizaje Automático; Melanoma; Modelos Biológicos; Humanos; Melanoma/epidemiología; Melanoma/metabolismo; Melanoma/patología; Valor Predictivo de las Pruebas; Medición de Riesgo; Factores de Riesgo

Palabras clave

Big data; Cloud computing; Early detection of cancer; Electronic health records; Machine learning

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Registros Electrónicos de Salud / Aprendizaje Automático / Macrodatos / Melanoma / Modelos Biológicos Tipo de estudio: Etiology_studies / Prognostic_studies / Risk_factors_studies / Screening_studies Límite: Humans Idioma: En Revista: Comput Biol Med Año: 2019 Tipo del documento: Article

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google