Search | VHL Search Portal

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

De Angeli, Kevin; Gao, Shang; Danciu, Ioana; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen; Wiggins, Charles; Damesyn, Mark; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia D; Yoon, Hong-Jun.

J Biomed Inform ; 125: 103957, 2022 01.

Article in English | MEDLINE | ID: mdl-34823030

ABSTRACT

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.

Subject(s)

Natural Language Processing , Neoplasms , Electronic Health Records , Humans , Machine Learning , Neural Networks, Computer

Deep active learning for classifying cancer pathology reports.

De Angeli, Kevin; Gao, Shang; Alawad, Mohammed; Yoon, Hong-Jun; Schaefferkoetter, Noah; Wu, Xiao-Cheng; Durbin, Eric B; Doherty, Jennifer; Stroup, Antoinette; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia.

BMC Bioinformatics ; 22(1): 113, 2021 Mar 09.

Article in English | MEDLINE | ID: mdl-33750288

ABSTRACT

BACKGROUND: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model. RESULTS: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes. CONCLUSIONS: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.

Subject(s)

Machine Learning , Neoplasms , Algorithms , Humans , Neoplasms/genetics , Neoplasms/pathology , Neural Networks, Computer

Using ensembles and distillation to optimize the deployment of deep learning models for the classification of electronic cancer pathology reports.

De Angeli, Kevin; Gao, Shang; Blanchard, Andrew; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen M; Wiggins, Charles; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia; Yoon, Hong-Jun.

JAMIA Open ; 5(3): ooac075, 2022 Oct.

Article in English | MEDLINE | ID: mdl-36110150

ABSTRACT

Objective: We aim to reduce overfitting and model overconfidence by distilling the knowledge of an ensemble of deep learning models into a single model for the classification of cancer pathology reports. Materials and Methods: We consider the text classification problem that involves 5 individual tasks. The baseline model consists of a multitask convolutional neural network (MtCNN), and the implemented ensemble (teacher) consists of 1000 MtCNNs. We performed knowledge transfer by training a single model (student) with soft labels derived through the aggregation of ensemble predictions. We evaluate performance based on accuracy and abstention rates by using softmax thresholding. Results: The student model outperforms the baseline MtCNN in terms of abstention rates and accuracy, thereby allowing the model to be used with a larger volume of documents when deployed. The highest boost was observed for subsite and histology, for which the student model classified an additional 1.81% reports for subsite and 3.33% reports for histology. Discussion: Ensemble predictions provide a useful strategy for quantifying the uncertainty inherent in labeled data and thereby enable the construction of soft labels with estimated probabilities for multiple classes for a given document. Training models with the derived soft labels reduce model confidence in difficult-to-classify documents, thereby leading to a reduction in the number of highly confident wrong predictions. Conclusions: Ensemble model distillation is a simple tool to reduce model overconfidence in problems with extreme class imbalance and noisy datasets. These methods can facilitate the deployment of deep learning models in high-risk domains with low computational resources where minimizing inference time is required.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL