Pesquisa | BVS IEC

Scalable deep text comprehension for Cancer surveillance on high-performance computing.

Qiu, John X; Yoon, Hong-Jun; Srivastava, Kshitij; Watson, Thomas P; Blair Christian, J; Ramanathan, Arvind; Wu, Xiao C; Fearn, Paul A; Tourassi, Georgia D.

BMC Bioinformatics ; 19(Suppl 18): 488, 2018 Dec 21.

Artigo em Inglês | MEDLINE | ID: mdl-30577743

RESUMO

BACKGROUND: Deep Learning (DL) has advanced the state-of-the-art capabilities in bioinformatics applications which has resulted in trends of increasingly sophisticated and computationally demanding models trained by larger and larger data sets. This vastly increased computational demand challenges the feasibility of conducting cutting-edge research. One solution is to distribute the vast computational workload across multiple computing cluster nodes with data parallelism algorithms. In this study, we used a High-Performance Computing environment and implemented the Downpour Stochastic Gradient Descent algorithm for data parallelism to train a Convolutional Neural Network (CNN) for the natural language processing task of information extraction from a massive dataset of cancer pathology reports. We evaluated the scalability improvements using data parallelism training and the Titan supercomputer at Oak Ridge Leadership Computing Facility. To evaluate scalability, we used different numbers of worker nodes and performed a set of experiments comparing the effects of different training batch sizes and optimizer functions. RESULTS: We found that Adadelta would consistently converge at a lower validation loss, though requiring over twice as many training epochs as the fastest converging optimizer, RMSProp. The Adam optimizer consistently achieved a close 2nd place minimum validation loss significantly faster; using a batch size of 16 and 32 allowed the network to converge in only 4.5 training epochs. CONCLUSIONS: We demonstrated that the networked training process is scalable across multiple compute nodes communicating with message passing interface while achieving higher classification accuracy compared to a traditional machine learning algorithm.

Assuntos

Metodologias Computacionais , Aprendizado Profundo/tendências , Neoplasias/diagnóstico , Compreensão , Humanos , Neoplasias/patologia , Redes Neurais de Computação

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.

Alawad, Mohammed; Gao, Shang; Qiu, John X; Yoon, Hong Jun; Blair Christian, J; Penberthy, Lynne; Mumphrey, Brent; Wu, Xiao-Cheng; Coyle, Linda; Tourassi, Georgia.

J Am Med Inform Assoc ; 27(1): 89-98, 2020 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-31710668

RESUMO

OBJECTIVE: We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. MATERIALS AND METHODS: Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). RESULTS: MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. CONCLUSIONS: The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task-specific model.

Assuntos

Armazenamento e Recuperação da Informação/métodos , Aprendizado de Máquina , Processamento de Linguagem Natural , Neoplasias/patologia , Redes Neurais de Computação , Sistema de Registros , Humanos , Neoplasias/classificação , Máquina de Vetores de Suporte

Classifying cancer pathology reports with hierarchical self-attention networks.

Gao, Shang; Qiu, John X; Alawad, Mohammed; Hinkle, Jacob D; Schaefferkoetter, Noah; Yoon, Hong-Jun; Christian, Blair; Fearn, Paul A; Penberthy, Lynne; Wu, Xiao-Cheng; Coyle, Linda; Tourassi, Georgia; Ramanathan, Arvind.

Artif Intell Med ; 101: 101726, 2019 11.

Artigo em Inglês | MEDLINE | ID: mdl-31813492

RESUMO

We introduce a deep learning architecture, hierarchical self-attention networks (HiSANs), designed for classifying pathology reports and show how its unique architecture leads to a new state-of-the-art in accuracy, faster training, and clear interpretability. We evaluate performance on a corpus of 374,899 pathology reports obtained from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program. Each pathology report is associated with five clinical classification tasks - site, laterality, behavior, histology, and grade. We compare the performance of the HiSAN against other machine learning and deep learning approaches commonly used on medical text data - Naive Bayes, logistic regression, convolutional neural networks, and hierarchical attention networks (the previous state-of-the-art). We show that HiSANs are superior to other machine learning and deep learning text classifiers in both accuracy and macro F-score across all five classification tasks. Compared to the previous state-of-the-art, hierarchical attention networks, HiSANs not only are an order of magnitude faster to train, but also achieve about 1% better relative accuracy and 5% better relative macro F-score.

Assuntos

Neoplasias/patologia , Aprendizado Profundo , Humanos , Processamento de Linguagem Natural , Neoplasias/classificação , Redes Neurais de Computação

Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports.

Qiu, John X; Yoon, Hong-Jun; Fearn, Paul A; Tourassi, Georgia D.

IEEE J Biomed Health Inform ; 22(1): 244-251, 2018 01.

Artigo em Inglês | MEDLINE | ID: mdl-28475069

RESUMO

Pathology reports are a primary source of information for cancer registries which process high volumes of free-text reports annually. Information extraction and coding is a manual, labor-intensive process. In this study, we investigated deep learning and a convolutional neural network (CNN), for extracting ICD-O-3 topographic codes from a corpus of breast and lung cancer pathology reports. We performed two experiments, using a CNN and a more conventional term frequency vector approach, to assess the effects of class prevalence and inter-class transfer learning. The experiments were based on a set of 942 pathology reports with human expert annotations as the gold standard. CNN performance was compared against a more conventional term frequency vector space approach. We observed that the deep learning models consistently outperformed the conventional approaches in the class prevalence experiment, resulting in micro- and macro-F score increases of up to 0.132 and 0.226, respectively, when class labels were well populated. Specifically, the best performing CNN achieved a micro-F score of 0.722 over 12 ICD-O-3 topography codes. Transfer learning provided a consistent but modest performance boost for the deep learning methods but trends were contingent on the CNN method and cancer site. These encouraging results demonstrate the potential of deep learning for automated abstraction of pathology reports.

Assuntos

Inteligência Artificial , Diagnóstico por Computador/métodos , Registros Eletrônicos de Saúde , Neoplasias , Humanos , Neoplasias/classificação , Neoplasias/diagnóstico , Neoplasias/patologia , Máquina de Vetores de Suporte

Hierarchical attention networks for information extraction from cancer pathology reports.

Gao, Shang; Young, Michael T; Qiu, John X; Yoon, Hong-Jun; Christian, James B; Fearn, Paul A; Tourassi, Georgia D; Ramanthan, Arvind.

J Am Med Inform Assoc ; 25(3): 321-330, 2018 Mar 01.

Artigo em Inglês | MEDLINE | ID: mdl-29155996

RESUMO

OBJECTIVE: We explored how a deep learning (DL) approach based on hierarchical attention networks (HANs) can improve model performance for multiple information extraction tasks from unstructured cancer pathology reports compared to conventional methods that do not sufï¬ciently capture syntactic and semantic contexts from free-text documents. MATERIALS AND METHODS: Data for our analyses were obtained from 942 deidentiï¬ed pathology reports collected by the National Cancer Institute Surveillance, Epidemiology, and End Results program. The HAN was implemented for 2 information extraction tasks: (1) primary site, matched to 12 International Classification of Diseases for Oncology topography codes (7 breast, 5 lung primary sites), and (2) histological grade classiï¬cation, matched to G1-G4. Model performance metrics were compared to conventional machine learning (ML) approaches including naive Bayes, logistic regression, support vector machine, random forest, and extreme gradient boosting, and other DL models, including a recurrent neural network (RNN), a recurrent neural network with attention (RNN w/A), and a convolutional neural network. RESULTS: Our results demonstrate that for both information tasks, HAN performed signiï¬cantly better compared to the conventional ML and DL techniques. In particular, across the 2 tasks, the mean micro and macro F-scores for the HAN with pretraining were (0.852,0.708), compared to naive Bayes (0.518, 0.213), logistic regression (0.682, 0.453), support vector machine (0.634, 0.434), random forest (0.698, 0.508), extreme gradient boosting (0.696, 0.522), RNN (0.505, 0.301), RNN w/A (0.637, 0.471), and convolutional neural network (0.714, 0.460). CONCLUSIONS: HAN-based DL models show promise in information abstraction tasks within unstructured clinical pathology reports.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA