ABSTRACT
BACKGROUND: Public Data Commons (PDC) have been highlighted in the scientific literature for their capacity to collect and harmonize big data. On the other hand, local data commons (LDC), located within an institution or organization, have been underrepresented in the scientific literature, even though they are a critical part of research infrastructure. Being closest to the sources of data, LDCs provide the ability to collect and maintain the most up-to-date, high-quality data within an organization, closest to the sources of the data. As a data provider, LDCs have many challenges in both collecting and standardizing data, moreover, as a consumer of PDC, they face problems of data harmonization stemming from the monolithic harmonization pipeline designs commonly adapted by many PDCs. Unfortunately, existing guidelines and resources for building and maintaining data commons exclusively focus on PDC and provide very little information on LDC. RESULTS: This article focuses on four important observations. First, there are three different types of LDC service models that are defined based on their roles and requirements. These can be used as guidelines for building new LDC or enhancing the services of existing LDC. Second, the seven core services of LDC are discussed, including cohort identification and facilitation of genomic sequencing, the management of molecular reports and associated infrastructure, quality control, data harmonization, data integration, data sharing, and data access control. Third, instead of commonly developed monolithic systems, we propose a new data sharing method for data harmonization that combines both divide-and-conquer and bottom-up approaches. Finally, an end-to-end LDC implementation is introduced with real-world examples. CONCLUSIONS: Although LDCs are an optimal place to identify and address data quality issues, they have traditionally been relegated to the role of passive data provider for much larger PDC. Indeed, many LDCs limit their functions to only conducting routine data storage and transmission tasks due to a lack of information on how to design, develop, and improve their services using limited resources. We hope that this work will be the first small step in raising awareness among the LDCs of their expanded utility and to publicize to a wider audience the importance of LDC.
Subject(s)
Big Data , Information Dissemination , Developing Countries , HumansABSTRACT
BACKGROUND: The Kentucky Cancer Registry (KCR) is a central cancer registry for the state of Kentucky that receives data about incident cancer cases from all healthcare facilities in the state within 6Ā months of diagnosis. Similar to all other U.S. and Canadian cancer registries, KCR uses a data dictionary provided by the North American Association of Central Cancer Registries (NAACCR) for standardized data entry. The NAACCR data dictionary is not an ontological system. Mapping between the NAACCR data dictionary and the National Cancer Institute (NCI) Thesaurus (NCIt) will facilitate the enrichment, dissemination and utilization of cancer registry data. We introduce a web-based system, called Interactive Mapping Interface (IMI), for creating mappings from data dictionaries to ontologies, in particular from NAACCR to NCIt. METHOD: IMI has been designed as a general approach with three components: (1) ontology library; (2) mapping interface; and (3) recommendation engine. The ontology library provides a list of ontologies as targets for building mappings. The mapping interface consists of six modules: project management, mapping dashboard, access control, logs and comments, hierarchical visualization, and result review and export. The built-in recommendation engine automatically identifies a list of candidate concepts to facilitate the mapping process. RESULTS: We report the architecture design and interface features of IMI. To validate our approach, we implemented an IMI prototype and pilot-tested features using the IMI interface to map a sample set of NAACCR data elements to NCIt concepts. 47 out of 301 NAACCR data elements have been mapped to NCIt concepts. Five branches of hierarchical tree have been identified from these mapped concepts for visual inspection. CONCLUSIONS: IMI provides an interactive, web-based interface for building mappings from data dictionaries to ontologies. Although our pilot-testing scope is limited, our results demonstrate feasibility using IMI for semantic enrichment of cancer registry data by mapping NAACCR data elements to NCIt concepts.
Subject(s)
Biological Ontologies , Neoplasms , Canada/epidemiology , Humans , Internet , Neoplasms/diagnosis , Neoplasms/epidemiology , Registries , Vocabulary, ControlledABSTRACT
OBJECTIVE: We study the performance of machine learning (ML) methods, including neural networks (NNs), to extract mutational test results from pathology reports collected by cancer registries. Given the lack of hand-labeled datasets for mutational test result extraction, we focus on the particular use-case of extracting Epidermal Growth Factor Receptor mutation results in non-small cell lung cancers. We explore the generalization of NNs across different registries where our goals are twofold: (1) to assess how well models trained on a registry's data port to test data from a different registry and (2) to assess whether and to what extent such models can be improved using state-of-the-art neural domain adaptation techniques under different assumptions about what is available (labeled vs unlabeled data) at the target registry site. MATERIALS AND METHODS: We collected data from two registries: the Kentucky Cancer Registry (KCR) and the Fred Hutchinson Cancer Research Center (FH) Cancer Surveillance System. We combine NNs with adversarial domain adaptation to improve cross-registry performance. We compare to other classifiers in the standard supervised classification, unsupervised domain adaptation, and supervised domain adaptation scenarios. RESULTS: The performance of ML methods varied between registries. To extract positive results, the basic convolutional neural network (CNN) had an F1 of 71.5% on the KCR dataset and 95.7% on the FH dataset. For the KCR dataset, the CNN F1 results were low when trained on FH data (Positive F1: 23%). Using our proposed adversarial CNN, without any labeled data, we match the F1 of the models trained directly on each target registry's data. The adversarial CNN F1 improved when trained on FH and applied to KCR dataset (Positive F1: 70.8%). We found similar performance improvements when we trained on KCR and tested on FH reports (Positive F1: 45% to 96%). CONCLUSION: Adversarial domain adaptation improves the performance of NNs applied to pathology reports. In the unsupervised domain adaptation setting, we match the performance of models that are trained directly on target registry's data by using source registry's labeled data and unlabeled examples from the target registry.
Subject(s)
Machine Learning , Mutation , Neoplasms/genetics , Neoplasms/pathology , Registries/statistics & numerical data , Carcinoma, Non-Small-Cell Lung/genetics , Carcinoma, Non-Small-Cell Lung/pathology , Computational Biology , Data Mining , Deep Learning , ErbB Receptors/genetics , Humans , Lung Neoplasms/genetics , Lung Neoplasms/pathology , Neural Networks, ComputerABSTRACT
CONTEXT: Patient misperceptions are a strong barrier to early palliative care discussions and referrals during advanced lung cancer treatment. OBJECTIVES: We developed and tested the acceptability of a web-based patient-facing palliative care education and screening tool intended for use in a planned multilevel intervention (i.e., patient, clinician, system-level targets). METHODS: We elicited feedback from advanced lung cancer patients (n = 6), oncology and palliative care clinicians (n = 4), and a clinic administrator (n = 1) on the perceived relevance of the intervention. We then tested the prototype of a patient-facing tool for patient acceptability and preliminary effects on patient palliative care knowledge and motivation. RESULTS: Partners agreed that the intervention-clinician palliative care education and an electronic health record-integrated patient tool-is relevant and their feedback informed development of the patient prototype. Advanced stage lung cancer patients (n = 20; age 60 Ā± 9.8; 40% male; 70% with a technical degree or less) reviewed and rated the prototype on a five-point scale for acceptability (4.48 Ā± 0.55), appropriateness (4.37 Ā± 0.62), and feasibility (4.43 Ā± 0.59). After using the prototype, 75% were interested in using palliative care and 80% were more motivated to talk to their oncologist about it. Of patients who had or were at risk of having misperceptions about palliative care (e.g., conflating it with hospice), 100% no longer held the misperceptions after using the prototype. CONCLUSION: The palliative care education and screening tool is acceptable to patients and may address misperceptions and motivate palliative care discussions during treatment.
Subject(s)
Hospice Care , Hospices , Lung Neoplasms , Neoplasms , Humans , Male , Middle Aged , Aged , Female , Palliative Care , Lung Neoplasms/therapy , Referral and Consultation , Neoplasms/therapyABSTRACT
Objective: The manual extraction of case details from patient records for cancer surveillance efforts is a resource-intensive task. Natural Language Processing (NLP) techniques have been proposed for automating the identification of key details in clinical notes. Our goal was to develop NLP application programming interfaces (APIs) for integration into cancer registry data abstraction tools in a computer-assisted abstraction setting. Methods: We used cancer registry manual abstraction processes to guide the design of DeepPhe-CR, a web-based NLP service API. The coding of key variables was done through NLP methods validated using established workflows. A container-based implementation including the NLP wasdeveloped. Existing registry data abstraction software was modified to include results from DeepPhe-CR. An initial usability study with data registrars provided early validation of the feasibility of the DeepPhe-CR tools. Results: API calls support submission of single documents and summarization of cases across multiple documents. The container-based implementation uses a REST router to handle requests and support a graph database for storing results. NLP modules extract topography, histology, behavior, laterality, and grade at 0.79-1.00 F1 across common and rare cancer types (breast, prostate, lung, colorectal, ovary and pediatric brain) on data from two cancer registries. Usability study participants were able to use the tool effectively and expressed interest in adopting the tool. Discussion: Our DeepPhe-CR system provides a flexible architecture for building cancer-specific NLP tools directly into registrar workflows in a computer-assisted abstraction setting. Improving user interactions in client tools, may be needed to realize the potential of these approaches. DeepPhe-CR: https://deepphe.github.io/.
ABSTRACT
PURPOSE: Manual extraction of case details from patient records for cancer surveillance is a resource-intensive task. Natural Language Processing (NLP) techniques have been proposed for automating the identification of key details in clinical notes. Our goal was to develop NLP application programming interfaces (APIs) for integration into cancer registry data abstraction tools in a computer-assisted abstraction setting. METHODS: We used cancer registry manual abstraction processes to guide the design of DeepPhe-CR, a web-based NLP service API. The coding of key variables was performed through NLP methods validated using established workflows. A container-based implementation of the NLP methods and the supporting infrastructure was developed. Existing registry data abstraction software was modified to include results from DeepPhe-CR. An initial usability study with data registrars provided early validation of the feasibility of the DeepPhe-CR tools. RESULTS: API calls support submission of single documents and summarization of cases across one or more documents. The container-based implementation uses a REST router to handle requests and support a graph database for storing results. NLP modules extract topography, histology, behavior, laterality, and grade at 0.79-1.00 F1 across multiple cancer types (breast, prostate, lung, colorectal, ovary, and pediatric brain) from data of two population-based cancer registries. Usability study participants were able to use the tool effectively and expressed interest in the tool. CONCLUSION: The DeepPhe-CR system provides an architecture for building cancer-specific NLP tools directly into registrar workflows in a computer-assisted abstraction setting. Improved user interactions in client tools may be needed to realize the potential of these approaches.
Subject(s)
Natural Language Processing , Neoplasms , Male , Female , Humans , Child , Software , Prostate , Registries , Neoplasms/diagnosis , Neoplasms/therapyABSTRACT
In August 2022, the Cancer Informatics for Cancer Centers brought together cancer informatics leaders for its biannual symposium, Precision Medicine Applications in Radiation Oncology, co-chaired by Quynh-Thu Le, MD (Stanford University), and Walter J. Curran, MD (GenesisCare). Over the course of 3 days, presenters discussed a range of topics relevant to radiation oncology and the cancer informatics community more broadly, including biomarker development, decision support algorithms, novel imaging tools, theranostics, and artificial intelligence (AI) for the radiotherapy workflow. Since the symposium, there has been an impressive shift in the promise and potential for integration of AI in clinical care, accelerated in large part by major advances in generative AI. AI is now poised more than ever to revolutionize cancer care. Radiation oncology is a field that uses and generates a large amount of digital data and is therefore likely to be one of the first fields to be transformed by AI. As experts in the collection, management, and analysis of these data, the informatics community will take a leading role in ensuring that radiation oncology is prepared to take full advantage of these technological advances. In this report, we provide highlights from the symposium, which took place in Santa Barbara, California, from August 29 to 31, 2022. We discuss lessons learned from the symposium for data acquisition, management, representation, and sharing, and put these themes into context to prepare radiation oncology for the successful and safe integration of AI and informatics technologies.
Subject(s)
Neoplasms , Radiation Oncology , Humans , Artificial Intelligence , Informatics , Neoplasms/diagnosis , Neoplasms/radiotherapyABSTRACT
BACKGROUND: Population-based state cancer registries are an authoritative source for cancer statistics in the United States. They routinely collect a variety of data, including patient demographics, primary tumor site, stage at diagnosis, first course of treatment, and survival, on every cancer case that is reported across all U.S. states and territories. The goal of our project is to enrich NCI's Surveillance, Epidemiology, and End Results (SEER) registry data with high-quality population-based biospecimen data in the form of digital pathology, machine-learning-based classifications, and quantitative histopathology imaging feature sets (referred to here as Pathomics features). MATERIALS AND METHODS: As part of the project, the underlying informatics infrastructure was designed, tested, and implemented through close collaboration with several participating SEER registries to ensure consistency with registry processes, computational scalability, and ability to support creation of population cohorts that span multiple sites. Utilizing computational imaging algorithms and methods to both generate indices and search for matches makes it possible to reduce inter- and intra-observer inconsistencies and to improve the objectivity with which large image repositories are interrogated. RESULTS: Our team has created and continues to expand a well-curated repository of high-quality digitized pathology images corresponding to subjects whose data are routinely collected by the collaborating registries. Our team has systematically deployed and tested key, visual analytic methods to facilitate automated creation of population cohorts for epidemiological studies and tools to support visualization of feature clusters and evaluation of whole-slide images. As part of these efforts, we are developing and optimizing advanced search and matching algorithms to facilitate automated, content-based retrieval of digitized specimens based on their underlying image features and staining characteristics. CONCLUSION: To meet the challenges of this project, we established the analytic pipelines, methods, and workflows to support the expansion and management of a growing repository of high-quality digitized pathology and information-rich, population cohorts containing objective imaging and clinical attributes to facilitate studies that seek to discriminate among different subtypes of disease, stratify patient populations, and perform comparisons of tumor characteristics within and across patient cohorts. We have also successfully developed a suite of tools based on a deep-learning method to perform quantitative characterizations of tumor regions, assess infiltrating lymphocyte distributions, and generate objective nuclear feature measurements. As part of these efforts, our team has implemented reliable methods that enable investigators to systematically search through large repositories to automatically retrieve digitized pathology specimens and correlated clinical data based on their computational signatures.
ABSTRACT
Tracking population-level cancer information is essential for researchers, clinicians, policymakers, and the public. Unfortunately, much of the information is stored as unstructured data in pathology reports. Thus, too process the information, we require either automated extraction techniques or manual curation. Moreover, many of the cancer-related concepts appear infrequently in real-world training datasets. Automated extraction is difficult because of the limited data. This study introduces a novel technique that incorporates structured expert knowledge to improve histology and topography code classification models. Using pathology reports collected from the Kentucky Cancer Registry, we introduce a novel multi-task training approach with hierarchical regularization that incorporates structured information about the International Classification of Diseases for Oncology, 3rd Edition classes to improve predictive performance. Overall, we find that our method improves both micro and macro F1. For macro F1, we achieve up to a 6% absolute improvement for topography codes and up to 4% absolute improvement for histology codes.
ABSTRACT
Population cancer registries can benefit from Deep Learning (DL) to automatically extract cancer characteristics from the high volume of unstructured pathology text reports they process annually. The success of DL to tackle this and other real-world problems is proportional to the availability of large labeled datasets for model training. Although collaboration among cancer registries is essential to fully exploit the promise of DL, privacy and confidentiality concerns are main obstacles for data sharing across cancer registries. Moreover, DL for natural language processing (NLP) requires sharing a vocabulary dictionary for the embedding layer which may contain patient identifiers. Thus, even distributing the trained models across cancer registries causes a privacy violation issue. In this paper, we propose DL NLP model distribution via privacy-preserving transfer learning approaches without sharing sensitive data. These approaches are used to distribute a multitask convolutional neural network (MT-CNN) NLP model among cancer registries. The model is trained to extract six key cancer characteristics - tumor site, subsite, laterality, behavior, histology, and grade - from cancer pathology reports. Using 410,064 pathology documents from two cancer registries, we compare our proposed approach to conventional transfer learning without privacy-preserving, single-registry models, and a model trained on centrally hosted data. The results show that transfer learning approaches including data sharing and model distribution outperform significantly the single-registry model. In addition, the best performing privacy-preserving model distribution approach achieves statistically indistinguishable average micro- and macro-F1 scores across all extraction tasks (0.823,0.580) as compared to the centralized model (0.827,0.585).
ABSTRACT
Pediatric brain and central nervous system tumors (PBCNSTs) are the most common solid tumors and are the leading cause of disease-related death in US children. PBCNST incidence rates in Kentucky are significantly higher than in the United States as a whole, and are even higher among Kentucky's Appalachian children. To understand and eventually eliminate such disparities, population-based research is needed to gain a thorough understanding of the epidemiology and etiology of the disease. This multi-institutional population-based retrospective cohort study is designed to identify factors associated with the high incidence of PBCNST in Kentucky, leveraging the infrastructure provided by the Kentucky Cancer Registry, its Virtual Tissue Repository (VTR), and the National Institutes of Health Gabriella Miller Kids First Data Resource Center (DRC). Spatiotemporal scan statistics have been used to explore geographic patterns of risk measured by standardized incidence ratios (SIRs) with 95% confidence intervals. The VTR is being used to collect biospecimens for the population-based cohort of PBCNST tissues that are being sequenced by Center for Data Driven Discovery in Biomedicine (D3b) at the Children's Hospital of Philadelphia (CHOP) with support from the Kids First DRC. After adjusting for demographic factors, we assess their potential relationship to environmental factors. We have identified regions in north-central and eastern Appalachian Kentucky where children experienced a significant increased risk of developing PBCNST from 1995-2017 (SIR, 1.48; 95% CI, 1.34-1.62). The VTR has been successful in the collection of a population-based cohort of 215 PBCNST specimens. Timely establishment of legal agreements for data sharing and tissue acquisition proved to be challenging which has been somewhat mitigated by the adoption of national agreement templates. Coronavirus disease 2019 (COVID-19) severely limited the generation of sequencing results due to laboratory shutdowns. However, tissue specimens processed before the shutdown indicated that punches were inferior to scrolls for generating enough quality material for DNA and RNA extraction. Informatics infrastructures that were developed have demonstrated the feasibility of our approach to generate and retrieve molecular results. Our study shows that population-based studies using historical tissue specimens are feasible and practical, but require significant investments in technical infrastructures.
Subject(s)
COVID-19 , Central Nervous System Neoplasms , Brain , Central Nervous System Neoplasms/epidemiology , Child , Humans , Incidence , Informatics , Kentucky/epidemiology , Registries , Retrospective Studies , SARS-CoV-2 , United StatesABSTRACT
PURPOSE: To audit and improve the completeness of the hierarchic (or is-a) relations of the National Cancer Institute (NCI) Thesaurus to support its role as a faceted system for querying cancer registry data. METHODS: We performed quality auditing of the 19.01d version of the NCI Thesaurus. Our hybrid auditing method consisted of three main steps: computing nonlattice subgraphs, constructing lexical features for concepts in each subgraph, and performing subsumption reasoning with each subgraph to automatically suggest potentially missing is-a relations. RESULTS: A total of 9,512 nonlattice subgraphs were obtained. Our method identified 925 potentially missing is-a relations in 441 nonlattice subgraphs; 72 of 176 reviewed samples were confirmed as valid missing is-a relations and have been incorporated in the newer versions of the NCI Thesaurus. CONCLUSION: Autosuggested changes resulting from our auditing method can improve the structural organization of the NCI Thesaurus in supporting its new role for faceted query.
Subject(s)
Neoplasms , Vocabulary, Controlled , Humans , National Cancer Institute (U.S.) , Neoplasms/epidemiology , Registries , United StatesABSTRACT
Automated text information extraction from cancer pathology reports is an active area of research to support national cancer surveillance. A well-known challenge is how to develop information extraction tools with robust performance across cancer registries. In this study we investigated whether transfer learning (TL) with a convolutional neural network (CNN) can facilitate cross-registry knowledge sharing. Specifically, we performed a series of experiments to determine whether a CNN trained with single-registry data is capable of transferring knowledge to another registry or whether developing a cross-registry knowledge database produces a more effective and generalizable model. Using data from two cancer registries and primary tumor site and topography as the information extraction task of interest, our study showed that TL results in 6.90% and 17.22% improvement of classification macro F-score over the baseline single-registry models. Detailed analysis illustrated that the observed improvement is evident in the low prevalence classes.
ABSTRACT
PURPOSE: SEER registries do not report results of epidermal growth factor receptor (EGFR) and anaplastic lymphoma kinase (ALK) mutation tests. To facilitate population-based research in molecularly defined subgroups of non-small-cell lung cancer (NSCLC), we assessed the validity of natural language processing (NLP) for the ascertainment of EGFR and ALK testing from electronic pathology (e-path) reports of NSCLC cases included in two SEER registries: the Cancer Surveillance System (CSS) and the Kentucky Cancer Registry (KCR). METHODS: We obtained 4,278 e-path reports from 1,634 patients who were diagnosed with stage IV nonsquamous NSCLC from September 1, 2011, to December 31, 2013, included in CSS. We used 855 CSS reports to train NLP systems for the ascertainment of EGFR and ALK test status (reported v not reported) and test results (positive v negative). We assessed sensitivity, specificity, and positive and negative predictive values in an internal validation sample of 3,423 CSS e-path reports and repeated the analysis in an external sample of 1,041 e-path reports from 565 KCR patients. Two oncologists manually reviewed all e-path reports to generate gold-standard data sets. RESULTS: NLP systems yielded internal validity metrics that ranged from 0.95 to 1.00 for EGFR and ALK test status and results in CSS e-path reports. NLP showed high internal accuracy for the ascertainment of EGFR and ALK in CSS patients-F scores of 0.95 and 0.96, respectively. In the external validation analysis, NLP yielded metrics that ranged from 0.02 to 0.96 in KCR reports and F scores of 0.70 and 0.72, respectively, in KCR patients. CONCLUSION: NLP is an internally valid method for the ascertainment of EGFR and ALK test information from e-path reports available in SEER registries, but future work is necessary to increase NLP external validity.
Subject(s)
Anaplastic Lymphoma Kinase/genetics , Carcinoma, Non-Small-Cell Lung/diagnosis , Carcinoma, Non-Small-Cell Lung/etiology , Lung Neoplasms/diagnosis , Lung Neoplasms/etiology , Mutation , Natural Language Processing , Adult , Aged , Algorithms , Carcinoma, Non-Small-Cell Lung/epidemiology , DNA Mutational Analysis , ErbB Receptors/genetics , Female , Genetic Testing , Humans , Kentucky/epidemiology , Lung Neoplasms/epidemiology , Machine Learning , Male , Middle Aged , Population Surveillance , Registries , Reproducibility of Results , SEER ProgramABSTRACT
BACKGROUND: Several variables are associated with the likelihood of isocitrate dehydrogenase 1 or 2 (IDH1/2) mutation in gliomas, though no guidelines yet exist for when testing is warranted, especially when an R132H IDH1 immunostain is negative. METHODS: A cohort of 89 patients was used to build IDH1/2 mutation prediction models in World Health Organization grades II-IV gliomas, and an external cohort of 100 patients was used for validation. Logistic regression and backward model selection with the Akaike information criterion were used to develop prediction models. RESULTS: A multivariable model, incorporating patient age, glioblastoma multiforme diagnosis, and prior history of grade II or III glioma, was developed to predict IDH1/2 mutation probability. This model generated an area under the curve (AUC) of 0.934 (95% CI: 0.878, 0.978) in the external validation cohort and 0.941 (95% CI: 0.918, 0.962) in the cohort of The Cancer Genome Atlas. When R132H IDH1 immunostain information was added, AUC increased to 0.986 (95% CI: 0.967, 0.998). This model had an AUC of 0.947 (95% CI: 0.891, 0.995) in predicting whether an R132H IDH1 immunonegative case harbored a less common IDH1 or IDH2 mutation. The models were also 94% accurate in predicting IDH1/2 mutation status in gliomas from The Cancer Genome Atlas. An interactive web-based application for calculating the probability of an IDH1/2 mutation is now available using these models. CONCLUSIONS: We have integrated multiple variables to generate a probability of an IDH1/2 mutation. The associated web-based application can help triage diffuse gliomas that would benefit from mutation testing in both clinical and research settings.
Subject(s)
Biomarkers, Tumor/genetics , Brain Neoplasms/diagnosis , Glioma/diagnosis , Isocitrate Dehydrogenase/genetics , Models, Statistical , Mutation/genetics , Adult , Aged , Aged, 80 and over , Brain Neoplasms/genetics , Cohort Studies , DNA Mutational Analysis , Female , Follow-Up Studies , Glioma/genetics , Humans , Male , Middle Aged , Neoplasm Grading , Prognosis , Validation Studies as Topic , Young AdultABSTRACT
Although registry specific requirements exist, cancer registries primarily identify reportable cases using a combination of particular ICD-O-3 topography and morphology codes assigned to cancer case abstracts of which free text pathology reports form a main component. The codes are generally extracted from pathology reports by trained human coders, sometimes with the help of software programs. Here we present results that improve on the state-of-the-art in automatic extraction of 57 generic sites from pathology reports using three representative machine learning algorithms in text classification. We use a dataset of 56,426 reports arising from 35 labs that report to the Kentucky Cancer Registry. Employing unigrams, bigrams, and named entities as features, our methods achieve a class-based micro F-score of 0.9 and macro F-score of 0.72. To our knowledge, this is the best result on extracting ICD-O-3 codes from pathology reports using a large number of possible codes. Given the large dataset we use (compared to other similar efforts) with reports from 35 different labs, we also expect our final models to generalize better when extracting primary sites from previously unseen reports.