Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
J Biomed Inform ; 149: 104576, 2024 01.
Artigo em Inglês | MEDLINE | ID: mdl-38101690

RESUMO

INTRODUCTION: Machine learning algorithms are expected to work side-by-side with humans in decision-making pipelines. Thus, the ability of classifiers to make reliable decisions is of paramount importance. Deep neural networks (DNNs) represent the state-of-the-art models to address real-world classification. Although the strength of activation in DNNs is often correlated with the network's confidence, in-depth analyses are needed to establish whether they are well calibrated. METHOD: In this paper, we demonstrate the use of DNN-based classification tools to benefit cancer registries by automating information extraction of disease at diagnosis and at surgery from electronic text pathology reports from the US National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) population-based cancer registries. In particular, we introduce multiple methods for selective classification to achieve a target level of accuracy on multiple classification tasks while minimizing the rejection amount-that is, the number of electronic pathology reports for which the model's predictions are unreliable. We evaluate the proposed methods by comparing our approach with the current in-house deep learning-based abstaining classifier. RESULTS: Overall, all the proposed selective classification methods effectively allow for achieving the targeted level of accuracy or higher in a trade-off analysis aimed to minimize the rejection rate. On in-distribution validation and holdout test data, with all the proposed methods, we achieve on all tasks the required target level of accuracy with a lower rejection rate than the deep abstaining classifier (DAC). Interpreting the results for the out-of-distribution test data is more complex; nevertheless, in this case as well, the rejection rate from the best among the proposed methods achieving 97% accuracy or higher is lower than the rejection rate based on the DAC. CONCLUSIONS: We show that although both approaches can flag those samples that should be manually reviewed and labeled by human annotators, the newly proposed methods retain a larger fraction and do so without retraining-thus offering a reduced computational cost compared with the in-house deep learning-based abstaining classifier.


Assuntos
Aprendizado Profundo , Humanos , Incerteza , Redes Neurais de Computação , Algoritmos , Aprendizado de Máquina
2.
JAMIA Open ; 5(3): ooac075, 2022 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-36110150

RESUMO

Objective: We aim to reduce overfitting and model overconfidence by distilling the knowledge of an ensemble of deep learning models into a single model for the classification of cancer pathology reports. Materials and Methods: We consider the text classification problem that involves 5 individual tasks. The baseline model consists of a multitask convolutional neural network (MtCNN), and the implemented ensemble (teacher) consists of 1000 MtCNNs. We performed knowledge transfer by training a single model (student) with soft labels derived through the aggregation of ensemble predictions. We evaluate performance based on accuracy and abstention rates by using softmax thresholding. Results: The student model outperforms the baseline MtCNN in terms of abstention rates and accuracy, thereby allowing the model to be used with a larger volume of documents when deployed. The highest boost was observed for subsite and histology, for which the student model classified an additional 1.81% reports for subsite and 3.33% reports for histology. Discussion: Ensemble predictions provide a useful strategy for quantifying the uncertainty inherent in labeled data and thereby enable the construction of soft labels with estimated probabilities for multiple classes for a given document. Training models with the derived soft labels reduce model confidence in difficult-to-classify documents, thereby leading to a reduction in the number of highly confident wrong predictions. Conclusions: Ensemble model distillation is a simple tool to reduce model overconfidence in problems with extreme class imbalance and noisy datasets. These methods can facilitate the deployment of deep learning models in high-risk domains with low computational resources where minimizing inference time is required.

3.
JAMIA Open ; 5(2): ooac049, 2022 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-35721398

RESUMO

Objectives: The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. We developed deep learning-based information extraction models from cancer pathology reports based on the ICD-O-3 coding standard. In this article, we describe extending the models to perform ICCC classification. Materials and Methods: We developed 2 models, ICD-O-3 classification and ICCC recoding (Model 1) and direct ICCC classification (Model 2), and 4 scenarios subject to the training sample size. We evaluated these models with a corpus consisting of 29 206 reports with age at diagnosis between 0 and 19 from 6 state cancer registries. Results: Our findings suggest that the direct ICCC classification (Model 2) is substantially better than reusing the ICD-O-3 classification model (Model 1). Applying the uncertainty quantification mechanism to assess the confidence of the algorithm in assigning a code demonstrated that the model achieved a micro-F1 score of 0.987 while abstaining (not sufficiently confident to assign a code) on only 14.8% of ambiguous pathology reports. Conclusions: Our experimental results suggest that the machine learning-based automatic information extraction from childhood cancer pathology reports in the ICCC is a reliable means of supplementing human annotators at state cancer registries by reading and abstracting the majority of the childhood cancer pathology reports accurately and reliably.

4.
Cancer Biomark ; 33(2): 185-198, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35213361

RESUMO

BACKGROUND: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. OBJECTIVE: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. RESULTS: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.


Assuntos
Confidencialidade , Aprendizado Profundo , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Neoplasias/epidemiologia , Inteligência Artificial , Aprendizado Profundo/normas , Humanos , Neoplasias/patologia , Sistema de Registros
5.
IEEE J Biomed Health Inform ; 26(6): 2796-2803, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35020599

RESUMO

Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.


Assuntos
Reprodutibilidade dos Testes , Coleta de Dados , Humanos
6.
J Natl Cancer Inst ; 114(6): 907-909, 2022 06 13.
Artigo em Inglês | MEDLINE | ID: mdl-34181001

RESUMO

The coronavirus disease 2019 (COVID-19) pandemic led to delayed medical care in the United States. We examined changes in patterns of cancer diagnosis and surgical treatment between January 1 and December 31 in 2020 and 2019 with real-time electronic pathology report data from population-based Surveillance, Epidemiology, and End Results cancer registries from Georgia and Louisiana. During 2020, there were 29 905 fewer pathology reports than in 2019, representing a 10.2% decline. Declines were observed in all age groups, including children and adolescents younger than 18 years. The nadir was early April 2020, with 42.8% fewer reports than in April 2019. Numbers of reports through December 2020 never consistently exceeded those in 2019 after first declines. Patterns were similar by age group and cancer site. Findings suggest substantial delays in diagnosis and treatment services for cancers during the pandemic. Ongoing evaluation can inform public health efforts to minimize any lasting adverse effects of the pandemic on cancer diagnosis, stage, treatment, and survival.


Assuntos
COVID-19 , Neoplasias , Adolescente , COVID-19/epidemiologia , Criança , Humanos , Neoplasias/diagnóstico , Neoplasias/epidemiologia , Neoplasias/terapia , Pandemias , Vigilância da População , Sistema de Registros , Estados Unidos/epidemiologia
7.
Med Care ; 60(1): 44-49, 2022 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-34812787

RESUMO

BACKGROUND: Cancer recurrence is an important measure of the impact of cancer treatment. However, no population-based data on recurrence are available. Pathology reports could potentially identify cancer recurrences. Their utility to capture recurrences is unknown. OBJECTIVE: This analysis assesses the sensitivity of pathology reports to identify patients with cancer recurrence and the stage at recurrence. SUBJECTS: The study includes patients with recurrent breast (n=214) or colorectal (n=203) cancers. RESEARCH DESIGN: This retrospective analysis included patients from a population-based cancer registry who were part of the Patient-Centered Outcomes Research (PCOR) Study, a project that followed cancer patients in-depth for 5 years after diagnosis to identify recurrences. MEASURES: Information abstracted from pathology reports for patients with recurrence was compared with their PCOR data (gold standard) to determine what percent had a pathology report at the time of recurrence, the sensitivity of text in the report to identify recurrence, and if the stage at recurrence could be determined from the pathology report. RESULTS: One half of cancer patients had a pathology report near the time of recurrence. For patients with a pathology report, the report's sensitivity to identify recurrence was 98.1% for breast cancer cases and 95.7% for colorectal cancer cases. The specific stage at recurrence from the pathology report had a moderate agreement with gold-standard data. CONCLUSIONS: Pathology reports alone cannot measure population-based recurrence of solid cancers but can identify specific cohorts of recurrent cancer patients. As electronic submission of pathology reports increases, these reports may identify specific recurrent patients in near real-time.


Assuntos
Documentação/normas , Neoplasias/diagnóstico , Neoplasias/patologia , Recidiva , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/epidemiologia , Neoplasias da Mama/patologia , Neoplasias Colorretais/diagnóstico , Neoplasias Colorretais/epidemiologia , Neoplasias Colorretais/patologia , Documentação/métodos , Documentação/estatística & dados numéricos , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Neoplasias/epidemiologia , Estudos Retrospectivos
8.
J Biomed Inform ; 125: 103957, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34823030

RESUMO

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.


Assuntos
Processamento de Linguagem Natural , Neoplasias , Registros Eletrônicos de Saúde , Humanos , Aprendizado de Máquina , Redes Neurais de Computação
9.
J Registry Manag ; 49(4): 109-113, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-37260810

RESUMO

The National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) program is continuously exploring opportunities to augment its already extensive collection of data, enhance the quality of reported cancer information, and contribute to more comprehensive analyses of cancer burden. This manuscript describes a recent linkage of the LexisNexis longitudinal residential history data with 11 SEER registries and provides estimates of the inter-state mobility of SEER cancer patients. To identify mobility from one state to another, we used state postal abbreviations to generate state-level residential histories. From this, we determined how often cancer patients moved from state-to-state. The results in this paper provide information on the linkage with LexisNexis data and useful information on state-to-state residential mobility patterns of a large portion of US cancer patients for the most recent 1-, 2-, 3-, 4-, and 5-year periods. We show that mobility patterns vary by geographic area, race/ethnicity and age, and cancer patients tend to move less than the general population.


Assuntos
Neoplasias , Humanos , Estados Unidos/epidemiologia , Neoplasias/epidemiologia , Sistema de Registros , Dinâmica Populacional , Etnicidade , Programa de SEER
10.
BMC Bioinformatics ; 22(1): 113, 2021 Mar 09.
Artigo em Inglês | MEDLINE | ID: mdl-33750288

RESUMO

BACKGROUND: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model. RESULTS: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes. CONCLUSIONS: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.


Assuntos
Aprendizado de Máquina , Neoplasias , Algoritmos , Humanos , Neoplasias/genética , Neoplasias/patologia , Redes Neurais de Computação
11.
IEEE J Biomed Health Inform ; 25(9): 3596-3607, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-33635801

RESUMO

Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text.


Assuntos
Processamento de Linguagem Natural , Redes Neurais de Computação , Humanos
12.
IEEE Trans Emerg Top Comput ; 9(3): 1219-1230, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-36117774

RESUMO

Population cancer registries can benefit from Deep Learning (DL) to automatically extract cancer characteristics from the high volume of unstructured pathology text reports they process annually. The success of DL to tackle this and other real-world problems is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are main obstacles for data sharing across cancer registries. Moreover, DL for natural language processing (NLP) requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this paper, we propose DL NLP model distribution via privacy-preserving transfer learning approaches without sharing sensitive data. These approaches are used to distribute a multitask convolutional neural network (MT-CNN) NLP model among cancer registries. The model is trained to extract six key cancer characteristics - tumor site, subsite, laterality, behavior, histology, and grade - from cancer pathology reports. Using 410,064 pathology documents from two cancer registries, we compare our proposed approach to conventional transfer learning without privacy-preserving, single-registry models, and a model trained on centrally hosted data. The results show that transfer learning approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F1 scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).

13.
J Biomed Inform ; 110: 103564, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32919043

RESUMO

OBJECTIVE: In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. MATERIALS AND METHODS: The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem-thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). RESULTS: We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. CONCLUSION: Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.


Assuntos
Neoplasias , Redes Neurais de Computação , Metodologias Computacionais , Humanos , Armazenamento e Recuperação da Informação , Aprendizado de Máquina
14.
PLoS One ; 15(5): e0232840, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32396579

RESUMO

Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence-for example, a single patient may generate multiple reports over the trajectory of a disease. In applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We test our approach on a corpus of 431,433 cancer pathology reports, and we show that incorporating case-level context significantly boosts classification accuracy across six classification tasks-site, subsite, laterality, histology, behavior, and grade. We expect that with minimal modifications, our add-on can be applied towards a wide range of other clinical text-based tasks.


Assuntos
Registros Eletrônicos de Saúde/classificação , Neoplasias/patologia , Técnicas Histológicas , Humanos , Processamento de Linguagem Natural , Programa de SEER
15.
BMC Med Res Methodol ; 20(1): 108, 2020 05 07.
Artigo em Inglês | MEDLINE | ID: mdl-32381039

RESUMO

BACKGROUND: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. METHODS: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. RESULTS: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. CONCLUSIONS: We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.


Assuntos
Aprendizado de Máquina , Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/epidemiologia , Redes Neurais de Computação
16.
J Natl Cancer Inst Monogr ; 2020(55): 82-88, 2020 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-32412070

RESUMO

BACKGROUND: Chemotherapy information in the population-based cancer registries is underascertained and lacks detail. We conducted a pilot study in the Georgia SEER Cancer Registry (GCR) to investigate the feasibility of supplementing chemotherapy information using billing claims from six private oncology practices (OP). METHODS: To assess cancer patients' representativeness from OP, we compared individuals with invasive first primary cancers diagnosed during 2013-2015 in the GCR (cohort 1) with those who had at least one OP claim in the 12 months after diagnosis (cohort 2). To assess completeness of OP claims to capture chemotherapy (yes or no), we further restricted cohort 2 to patients ages 65 years and older enrolled in fee-for-service Medicare Part A and B from the diagnosis date through 12 months follow-up or to the date of death. With Medicare data serving as the gold standard, sensitivity, specificity, and kappa statistics for the receipt of chemotherapy per OP claims were calculated by demographic and clinical characteristics. RESULTS: Cancer patients seeking care in the OP included in our analysis were not representative of the underlying patient population in the GCR. The practices underrepresented minorities and uninsured while overrepresenting females, persons with high socioeconomic status, patients residing outside the metropolitan Atlanta area, and persons with advance staged disease. The ability of practice claims to identify chemotherapy receipt was moderate (76.1% sensitivity) but varied by demographic and clinical characteristics (76.1-83.0%). CONCLUSIONS: Given the limited ability of OP claims to identify chemotherapy receipt, we suggest analyzing these data for hypothesis generation, but inference should be limited to this patient cohort.


Assuntos
Honorários e Preços , Medicare , Neoplasias , Programa de SEER , Idoso , Feminino , Georgia , Humanos , Masculino , Pessoa de Meia-Idade , Neoplasias/tratamento farmacológico , Projetos Piloto , Estados Unidos
17.
J Registry Manag ; 47(2): 37-47, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-35363670

RESUMO

BACKGROUND: The Social Security Administration Service to Epidemiological Researchers (SSA-SER) can help central cancer registries meet the contractual follow-up requirements of the Surveillance, Epidemiology, and End Results (SEER) Program and improve survival estimate accuracy. We evaluated the impact of first-time SSA-SER linkage on follow-up rates and survival estimates for 2 SEER registries. Methods: In May 2019, cancer registries in Idaho (Cancer Data Registry of Idaho [CDRI]) and New York (New York State Cancer Registry [NYSCR]) used results from an SSA-SER linkage to update date of last contact and vital status for patients with a SEER-reportable tumor diagnosed during 2000-2016. We compared follow-up completeness through 2017 between pre-SSA-SER linkage and post-SSA-SER linkage data. Among individuals with a first primary tumor diagnosed during 2009-2015, we calculated 60-month age-standardized all sites and site-specific relative survival ratio (RSR) estimates via the presumed alive method using pre-SSA linkage data, and survival time calculated from last known date of contact using post-SSA linkage data. Results: SSA-SER linkage improved overall followup completeness from 79.0% to 97.4% and 55.7% to 92.6% for CDRI and NYSCR, respectively. Follow-up completeness improved most for laboratory-only reported tumors, in situ tumors, melanomas of the skin, prostate cancers, and benign and borderline brain and other central nervous system tumors. Post-SSA linkage RSRs were lower than pre-SSA presumed alive RSRs by an average -0.47% and -2.16% for Idaho and New York, respectively. Conclusions: SSA-SER linkage greatly and efficiently improved follow-up completeness for the 2 participating registries and revealed small difference in survival estimates by method. Use of the SSA-SER by all US registries would standardize and improve US survival estimates.

18.
J Am Med Inform Assoc ; 27(1): 89-98, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31710668

RESUMO

OBJECTIVE: We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. MATERIALS AND METHODS: Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). RESULTS: MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. CONCLUSIONS: The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task-specific model.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Aprendizado de Máquina , Processamento de Linguagem Natural , Neoplasias/patologia , Redes Neurais de Computação , Sistema de Registros , Humanos , Neoplasias/classificação , Máquina de Vetores de Suporte
19.
Artif Intell Med ; 101: 101726, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31813492

RESUMO

We introduce a deep learning architecture, hierarchical self-attention networks (HiSANs), designed for classifying pathology reports and show how its unique architecture leads to a new state-of-the-art in accuracy, faster training, and clear interpretability. We evaluate performance on a corpus of 374,899 pathology reports obtained from the National Cancer Institute's (NCI) Surveillance, Epidemiology, and End Results (SEER) program. Each pathology report is associated with five clinical classification tasks - site, laterality, behavior, histology, and grade. We compare the performance of the HiSAN against other machine learning and deep learning approaches commonly used on medical text data - Naive Bayes, logistic regression, convolutional neural networks, and hierarchical attention networks (the previous state-of-the-art). We show that HiSANs are superior to other machine learning and deep learning text classifiers in both accuracy and macro F-score across all five classification tasks. Compared to the previous state-of-the-art, hierarchical attention networks, HiSANs not only are an order of magnitude faster to train, but also achieve about 1% better relative accuracy and 5% better relative macro F-score.


Assuntos
Neoplasias/patologia , Aprendizado Profundo , Humanos , Processamento de Linguagem Natural , Neoplasias/classificação , Redes Neurais de Computação
20.
Cancer ; 123(4): 697-703, 2017 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-27783399

RESUMO

BACKGROUND: Researchers have used prostate-specific antigen (PSA) values collected by central cancer registries to evaluate tumors for potential aggressive clinical disease. An independent study collecting PSA values suggested a high error rate (18%) related to implied decimal points. To evaluate the error rate in the Surveillance, Epidemiology, and End Results (SEER) program, a comprehensive review of PSA values recorded across all SEER registries was performed. METHODS: Consolidated PSA values for eligible prostate cancer cases in SEER registries were reviewed and compared with text documentation from abstracted records. Four types of classification errors were identified: implied decimal point errors, abstraction or coding implementation errors, nonsignificant errors, and changes related to "unknown" values. RESULTS: A total of 50,277 prostate cancer cases diagnosed in 2012 were reviewed. Approximately 94.15% of cases did not have meaningful changes (85.85% correct, 5.58% with a nonsignificant change of <1 ng/mL, and 2.80% with no clinical change). Approximately 5.70% of cases had meaningful changes (1.93% due to implied decimal point errors, 1.54% due to abstract or coding errors, and 2.23% due to errors related to unknown categories). Only 419 of the original 50,277 cases (0.83%) resulted in a change in disease stage due to a corrected PSA value. CONCLUSIONS: The implied decimal error rate was only 1.93% of all cases in the current validation study, with a meaningful error rate of 5.81%. The reasons for the lower error rate in SEER are likely due to ongoing and rigorous quality control and visual editing processes by the central registries. The SEER program currently is reviewing and correcting PSA values back to 2004 and will re-release these data in the public use research file. Cancer 2017;123:697-703. © 2016 American Cancer Society.


Assuntos
Valor Preditivo dos Testes , Antígeno Prostático Específico/sangue , Neoplasias da Próstata/epidemiologia , Programa de SEER , Humanos , Masculino , Estadiamento de Neoplasias , Neoplasias da Próstata/sangue , Neoplasias da Próstata/patologia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...