Efficient learning from big data for cancer risk modeling: A case study with melanoma.
Comput Biol Med
; 110: 29-39, 2019 07.
Article
en En
| MEDLINE
| ID: mdl-31112896
ABSTRACT
BACKGROUND:
Building cancer risk models from real-world data requires overcoming challenges in data preprocessing, efficient representation, and computational performance. We present a case study of a cloud-based approach to learning from de-identified electronic health record data and demonstrate its effectiveness for melanoma risk prediction.METHODS:
We used a hybrid distributed and non-distributed approach to computing in the cloud distributed processing with Apache Spark for data preprocessing and labeling, and non-distributed processing for machine learning model training with scikit-learn. Moreover, we explored the effects of sampling the training dataset to improve computational performance. Risk factors were evaluated using regression weights as well as tree SHAP values.RESULTS:
Among 4,061,172 patients who did not have melanoma through the 2016 calendar year, 10,129 were diagnosed with melanoma within one year. A gradient-boosted classifier achieved the best predictive performance with cross-validation (AUCâ¯=â¯0.799, Sensitivityâ¯=â¯0.753, Specificityâ¯=â¯0.688). Compared to a model built on the original data, a dataset two orders of magnitude smaller could achieve statistically similar or better performance with less than 1% of the training time and cost.CONCLUSIONS:
We produced a model that can effectively predict melanoma risk for a diverse dermatology population in the U.S. by using hybrid computing infrastructure and data sampling. For this de-identified clinical dataset, sampling approaches significantly shortened the time for model building while retaining predictive accuracy, allowing for more rapid machine learning model experimentation on familiar computing machinery. A large number of risk factors (>300) were required to produce the best model.Palabras clave
Texto completo:
1
Colección:
01-internacional
Base de datos:
MEDLINE
Asunto principal:
Registros Electrónicos de Salud
/
Aprendizaje Automático
/
Macrodatos
/
Melanoma
/
Modelos Biológicos
Tipo de estudio:
Etiology_studies
/
Prognostic_studies
/
Risk_factors_studies
/
Screening_studies
Límite:
Humans
Idioma:
En
Revista:
Comput Biol Med
Año:
2019
Tipo del documento:
Article