RESUMEN
BACKGROUND: Applying graph convolutional networks (GCN) to the classification of free-form natural language texts leveraged by graph-of-words features (TextGCN) was studied and confirmed to be an effective means of describing complex natural language texts. However, the text classification models based on the TextGCN possess weaknesses in terms of memory consumption and model dissemination and distribution. In this paper, we present a fast message passing network (FastMPN), implementing a GCN with message passing architecture that provides versatility and flexibility by allowing trainable node embedding and edge weights, helping the GCN model find the better solution. We applied the FastMPN model to the task of clinical information extraction from cancer pathology reports, extracting the following six properties: main site, subsite, laterality, histology, behavior, and grade. RESULTS: We evaluated the clinical task performance of the FastMPN models in terms of micro- and macro-averaged F1 scores. A comparison was performed with the multi-task convolutional neural network (MT-CNN) model. Results show that the FastMPN model is equivalent to or better than the MT-CNN. CONCLUSIONS: Our implementation revealed that our FastMPN model, which is based on the PyTorch platform, can train a large corpus (667,290 training samples) with 202,373 unique words in less than 3 minutes per epoch using one NVIDIA V100 hardware accelerator. Our experiments demonstrated that using this implementation, the clinical task performance scores of information extraction related to tumors from cancer pathology reports were highly competitive.
Asunto(s)
Procesamiento de Lenguaje Natural , Neoplasias , Redes Neurales de la Computación , Humanos , Neoplasias/clasificación , Minería de DatosRESUMEN
One of the challenges associated with understanding environmental impacts on cancer risk and outcomes is estimating potential exposures of individuals diagnosed with cancer to adverse environmental conditions over the life course. Historically, this has been partly due to the lack of reliable measures of cancer patients' potential environmental exposures before a cancer diagnosis. The emerging sources of cancer-related spatiotemporal environmental data and residential history information, coupled with novel technologies for data extraction and linkage, present an opportunity to integrate these data into the existing cancer surveillance data infrastructure, thereby facilitating more comprehensive assessment of cancer risk and outcomes. In this paper, we performed a landscape analysis of the available environmental data sources that could be linked to historical residential address information of cancer patients' records collected by the National Cancer Institute's Surveillance, Epidemiology, and End Results Program. The objective is to enable researchers to use these data to assess potential exposures at the time of cancer initiation through the time of diagnosis and even after diagnosis. The paper addresses the challenges associated with data collection and completeness at various spatial and temporal scales, as well as opportunities and directions for future research.
Asunto(s)
Exposición a Riesgos Ambientales , Neoplasias , Programa de VERF , Humanos , Programa de VERF/estadística & datos numéricos , Neoplasias/epidemiología , Neoplasias/etiología , Exposición a Riesgos Ambientales/efectos adversos , Estados Unidos/epidemiología , Bases de Datos Factuales , National Cancer Institute (U.S.) , Recolección de Datos/métodos , Fuentes de InformaciónRESUMEN
The National Cancer Institute and the Department of Energy strategic partnership applies advanced computing and predictive machine learning and deep learning models to automate the capture of information from unstructured clinical text for inclusion in cancer registries. Applications include extraction of key data elements from pathology reports, determination of whether a pathology or radiology report is related to cancer, extraction of relevant biomarker information, and identification of recurrence. With the growing complexity of cancer diagnosis and treatment, capturing essential information with purely manual methods is increasingly difficult. These new methods for applying advanced computational capabilities to automate data extraction represent an opportunity to close critical information gaps and create a nimble, flexible platform on which new information sources, such as genomics, can be added. This will ultimately provide a deeper understanding of the drivers of cancer and outcomes in the population and increase the timeliness of reporting. These advances will enable better understanding of how real-world patients are treated and the outcomes associated with those treatments in the context of our complex medical and social environment.
Asunto(s)
Aprendizaje Profundo , Aprendizaje Automático , Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/epidemiología , Estados Unidos/epidemiología , Sistema de Registros , National Cancer Institute (U.S.)RESUMEN
Although the Surveillance, Epidemiology, and End Results (SEER) Program has maintained high standards of quality and completeness, the traditional data captured through population-based cancer surveillance are no longer sufficient to understand the impact of cancer and its outcomes. Therefore, in recent years, the SEER Program has expanded the population it covers and enhanced the types of data that are being collected. Traditionally, surveillance systems collected data characterizing the patient and their cancer at the time of diagnosis, as well as limited information on the initial course of therapy. SEER performs active follow-up on cancer patients from diagnosis until death, ascertaining critical information on mortality and survival over time. With the growth of precision oncology and rapid development and dissemination of new diagnostics and treatments, the limited data that registries have traditionally captured around the time of diagnosis-although useful for characterizing the cancer-are insufficient for understanding why similar patients may have different outcomes. The molecular composition of the tumor and genetic factors such as BRCA status affect the patient's treatment response and outcomes. Capturing and stratifying by these critical risk factors are essential if we are to understand differences in outcomes among patients who may be demographically similar, have the same cancer, be diagnosed at the same stage, and receive the same treatment. In addition to the tumor characteristics, it is essential to understand all the therapies that a patient receives over time, not only for the initial treatment period but also if the cancer recurs or progresses. Capturing this subsequent therapy is critical not only for research but also to help patients understand their risk at the time of therapeutic decision making. This article serves as an introduction and foundation for a JNCI Monograph with specific articles focusing on innovative new methods and processes implemented or under development for the SEER Program. The following sections describe the need to evaluate the SEER Program and provide a summary or introduction of those key enhancements that have been or are in the process of being implemented for SEER.
Asunto(s)
Neoplasias , Programa de VERF , Humanos , Programa de VERF/estadística & datos numéricos , Neoplasias/terapia , Neoplasias/epidemiología , Neoplasias/diagnóstico , Estados Unidos/epidemiología , Vigilancia de la PoblaciónRESUMEN
BACKGROUND: The Surveillance, Epidemiology, and End Results (SEER) Program with the National Cancer Institute tested whether population-based cancer registries can serve as honest brokers to acquire tissue and data in the SEER-Linked Virtual Tissue Repository (VTR) Pilot. METHODS: We collected formalin-fixed, paraffin-embedded tissue and clinical data from patients with pancreatic ductal adenocarcinoma (PDAC) and breast cancer (BC) for two studies comparing cancer cases with highly unusual survival (≥5 years for PDAC and ≤30 months for BC) to pair-matched controls with usual survival (≤2 years for PDAC and ≥5 years for BC). Success was defined as the ability for registries to acquire tissue and data on cancer cases with highly unusual outcomes. RESULTS: Of 98 PDAC and 103 BC matched cases eligible for tissue collection, sources of attrition for tissue collection were tissue being unavailable, control paired with failed case, second control that was not requested, tumor necrosis ≥20%, and low tumor cellularity. In total, tissue meeting the study criteria was obtained for 70 (71%) PDAC and 74 (72%) BC matched cases. For patients with tissue received, clinical data completeness ranged from 59% for CA-19-9 after treatment to >95% for margin status, whether radiation therapy and chemotherapy were administered, and comorbidities. CONCLUSIONS: The VTR Pilot demonstrated the feasibility of using SEER cancer registries as honest brokers to provide tissue and clinical data for secondary use in research. Studies using this program should oversample by 45% to 50% to obtain sufficient sample size and targeted population representation and involve subspecialty matter expert pathologists for tissue selection.
Asunto(s)
Neoplasias de la Mama , Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Programa de VERF , Humanos , Femenino , Proyectos Piloto , Carcinoma Ductal Pancreático/terapia , Carcinoma Ductal Pancreático/patología , Estados Unidos/epidemiología , Masculino , Neoplasias de la Mama/terapia , Neoplasias de la Mama/patología , Neoplasias de la Mama/epidemiología , Neoplasias Pancreáticas/terapia , Neoplasias Pancreáticas/patología , Neoplasias Pancreáticas/epidemiología , Persona de Mediana Edad , Anciano , National Cancer Institute (U.S.) , Bancos de Tejidos , Sistema de Registros , Adulto , Estudios de Casos y ControlesRESUMEN
BACKGROUND: Precision medicine has become a mainstay of cancer care in recent years. The National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) Program has been an authoritative source of cancer statistics and data since 1973. However, tumor genomic information has not been adequately captured in the cancer surveillance data, which impedes population-based research on molecular subtypes. To address this, the SEER Program has developed and implemented a centralized process to link SEER registries' tumor cases with genomic test results that are provided by molecular laboratories to the registries. METHODS: Data linkages were carried out following operating procedures for centralized linkages established by the SEER Program. The linkages used Match*Pro, a probabilistic linkage software, and were facilitated by the registries' trusted third party (an honest broker). The SEER registries provide to NCI limited datasets that undergo preliminary evaluation prior to their release to the research community. RESULTS: Recently conducted genomic linkages included OncotypeDX Breast Recurrence Score, OncotypeDX Breast Ductal Carcinoma in Situ, OncotypeDX Genomic Prostate Score, Decipher Prostate Genomic Classifier, DecisionDX Uveal Melanoma, DecisionDX Preferentially Expressed Antigen in Melanoma, DecisionDX Melanoma, and germline tests results in Georgia and California SEER registries. CONCLUSIONS: The linkages of cancer cases from SEER registries with genomic test results obtained from molecular laboratories offer an effective approach for data collection in cancer surveillance. By providing de-identified data to the research community, the NCI's SEER Program enables scientists to investigate numerous research inquiries.
Asunto(s)
Genómica , Neoplasias , Sistema de Registros , Programa de VERF , Humanos , Programa de VERF/estadística & datos numéricos , Estados Unidos/epidemiología , Neoplasias/genética , Neoplasias/epidemiología , Neoplasias/diagnóstico , Genómica/métodos , Sistema de Registros/estadística & datos numéricos , Femenino , Masculino , Pruebas Genéticas/métodos , Pruebas Genéticas/estadística & datos numéricos , Registro Médico Coordinado/métodos , National Cancer Institute (U.S.)RESUMEN
BACKGROUND: The National Cancer Institute funds many large cohort studies that rely on self-reported cancer data requiring medical record validation. This is labor intensive, costly, and prone to underreporting or misreporting of cancer and disparity-related differential response. US population-based central cancer registries identify incident cancer within their catchment area, yielding all malignant neoplasms and benign brain and central nervous system tumors with standardized data fields. This manuscript describes the development, implementation, and features of a system to facilitate linkage between cohort studies and cancer registries and the release of cancer registry data for matched cohort participants. METHODS: The Virtual Pooled Registry-Cancer Linkage System (VPR-CLS) provides an online system to link cohorts with multiple state cancer registries by 1) securely transmitting a study file to registries, 2) providing an optimized linkage algorithm to generate preliminary match counts, and 3) providing a streamlined process and templated forms for submitting and tracking data requests for cohort participants who matched with registries. RESULTS: In 2022, the VPR-CLS launched with 45 registries, covering 95% of the US state populations and Puerto Rico. Registries have linked with 15 studies having 14â273-10.9 million participants. Except in 1 study, linkage sensitivity ranged from 87.0% to 99.9%. Numerous registries have adopted the VPR-CLS templated institutional review board-registry application (n = 39), templated data use agreement (n = 25), and central institutional review board (n = 16). CONCLUSIONS: The VPR-CLS markedly improves ascertainment of cancer outcomes and is the preferred approach for determination of outcomes from cohort studies, postmarketing surveillance, and clinical trials.
Asunto(s)
Registro Médico Coordinado , Neoplasias , Sistema de Registros , Humanos , Sistema de Registros/estadística & datos numéricos , Neoplasias/epidemiología , Neoplasias/diagnóstico , Estados Unidos/epidemiología , Registro Médico Coordinado/métodos , Estudios de Cohortes , National Cancer Institute (U.S.)RESUMEN
PURPOSE: This study assessed the prevalence of specific major adverse financial events (AFEs)-bankruptcies, liens, and evictions-before a cancer diagnosis and their association with later-stage cancer at diagnosis. METHODS: Patients age 20-69 years diagnosed with cancer during 2014-2015 were identified from the Seattle, Louisiana, and Georgia SEER population-based cancer registries. Registry data were linked with LexisNexis consumer data to identify patients with a history of court-documented AFEs before cancer diagnosis. The association of AFEs and later-stage cancer diagnoses (stages III/IV) was assessed using separate sex-specific multivariable logistic regression. RESULTS: Among 101,649 patients with cancer linked to LexisNexis data, 36,791 (36.2%) had a major AFE reported before diagnosis. The mean and median timing of the AFE closest to diagnosis were 93 and 77 months, respectively. AFEs were most common among non-Hispanic Black, unmarried, and low-income patients. Individuals with previous AFEs were more likely to be diagnosed with later-stage cancer than individuals with no AFE (males-odds ratio [OR], 1.09 [95% CI, 1.03 to 1.14]; P < .001; females-OR, 1.18 [95% CI, 1.13 to 1.24]; P < .0001) after adjusting for age, race, marital status, income, registry, and cancer type. Associations between AFEs prediagnosis and later-stage disease did not vary by AFE timing. CONCLUSION: One third of newly diagnosed patients with cancer had a major AFE before their diagnosis. Patients with AFEs were more likely to have later-stage diagnosis, even accounting for traditional measures of socioeconomic status that influence the stage at diagnosis. The prevalence of prediagnosis AFEs underscores financial vulnerability of patients with cancer before their diagnosis, before any subsequent financial burden associated with cancer treatment.
Asunto(s)
Negro o Afroamericano , Neoplasias , Adulto , Anciano , Femenino , Humanos , Masculino , Persona de Mediana Edad , Adulto Joven , Georgia/epidemiología , Neoplasias/diagnóstico , Neoplasias/epidemiología , Sistema de Registros , Estados Unidos/epidemiologíaRESUMEN
INTRODUCTION: Machine learning algorithms are expected to work side-by-side with humans in decision-making pipelines. Thus, the ability of classifiers to make reliable decisions is of paramount importance. Deep neural networks (DNNs) represent the state-of-the-art models to address real-world classification. Although the strength of activation in DNNs is often correlated with the network's confidence, in-depth analyses are needed to establish whether they are well calibrated. METHOD: In this paper, we demonstrate the use of DNN-based classification tools to benefit cancer registries by automating information extraction of disease at diagnosis and at surgery from electronic text pathology reports from the US National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) population-based cancer registries. In particular, we introduce multiple methods for selective classification to achieve a target level of accuracy on multiple classification tasks while minimizing the rejection amount-that is, the number of electronic pathology reports for which the model's predictions are unreliable. We evaluate the proposed methods by comparing our approach with the current in-house deep learning-based abstaining classifier. RESULTS: Overall, all the proposed selective classification methods effectively allow for achieving the targeted level of accuracy or higher in a trade-off analysis aimed to minimize the rejection rate. On in-distribution validation and holdout test data, with all the proposed methods, we achieve on all tasks the required target level of accuracy with a lower rejection rate than the deep abstaining classifier (DAC). Interpreting the results for the out-of-distribution test data is more complex; nevertheless, in this case as well, the rejection rate from the best among the proposed methods achieving 97% accuracy or higher is lower than the rejection rate based on the DAC. CONCLUSIONS: We show that although both approaches can flag those samples that should be manually reviewed and labeled by human annotators, the newly proposed methods retain a larger fraction and do so without retraining-thus offering a reduced computational cost compared with the in-house deep learning-based abstaining classifier.
Asunto(s)
Aprendizaje Profundo , Humanos , Incertidumbre , Redes Neurales de la Computación , Algoritmos , Aprendizaje AutomáticoRESUMEN
INTRODUCTION: Health care procedures including cancer screening and diagnosis were interrupted due to the COVID-19 pandemic. The extent of this impact on cancer care in the United States is not fully understood. We investigated pathology report volume as a reflection of trends in oncology services pre-pandemic and during the pandemic. METHODS: Electronic pathology reports were obtained from 11 U.S. central cancer registries from NCI's SEER Program. The reports were sorted by cancer site and document type using a validated algorithm. Joinpoint regression was used to model temporal trends from January 2018 to February 2020, project expected counts from March 2020 to February 2021 and calculate observed-to-expected ratios. Results were stratified by sex, age, cancer site, and report type. RESULTS: During the first 3 months of the pandemic, pathology report volume decreased by 25.5% and 17.4% for biopsy and surgery reports, respectively. The 12-month O/E ratio (March 2020-February 2021) was lowest for women (O/E 0.90) and patients 65 years and older (O/E 0.91) and lower for cancers with screening (melanoma skin, O/E 0.86; breast, O/E 0.88; lung O/E 0.89, prostate, O/E 0.90; colorectal, O/E 0.91) when compared with all other cancers combined. CONCLUSIONS: These findings indicate a decrease in cancer diagnosis, likely due to the COVID-19 pandemic. This decrease in the number of pathology reports may result in a stage shift causing a subsequent longer-term impact on survival patterns. IMPACT: Investigation on the longer-term impact of the pandemic on pathology services is vital to understand if cancer care delivery levels continue to be affected.
Asunto(s)
COVID-19 , Melanoma , Masculino , Humanos , Femenino , Estados Unidos/epidemiología , Programa de VERF , Pandemias , Incidencia , COVID-19/epidemiología , Sistema de RegistrosRESUMEN
Data-driven basic, translational, and clinical research has resulted in improved outcomes for children, adolescents, and young adults (AYAs) with pediatric cancers. However, challenges in sharing data between institutions, particularly in research, prevent addressing substantial unmet needs in children and AYA patients diagnosed with certain pediatric cancers. Systematically collecting and sharing data from every child and AYA can enable greater understanding of pediatric cancers, improve survivorship, and accelerate development of new and more effective therapies. To accomplish this goal, the Childhood Cancer Data Initiative (CCDI) was launched in 2019 at the National Cancer Institute. CCDI is a collaborative community endeavor supported by a 10-year, $50-million (in US dollars) annual federal investment. CCDI aims to learn from every patient diagnosed with a pediatric cancer by designing and building a data ecosystem that facilitates data collection, sharing, and analysis for researchers, clinicians, and patients across the cancer community. For example, CCDI's Molecular Characterization Initiative provides comprehensive clinical molecular characterization for children and AYAs with newly diagnosed cancers. Through these efforts, the CCDI strives to provide clinical benefit to patients and improvements in diagnosis and care through data-focused research support and to build expandable, sustainable data resources and workflows to advance research well past the planned 10 years of the initiative. Importantly, if CCDI demonstrates the success of this model for pediatric cancers, similar approaches can be applied to adults, transforming both clinical research and treatment to improve outcomes for all patients with cancer.
Asunto(s)
Neoplasias , Adolescente , Estados Unidos/epidemiología , Humanos , Niño , Adulto Joven , Neoplasias/terapia , Ecosistema , Recolección de Datos , National Cancer Institute (U.S.)RESUMEN
This retrospective observational study aimed to gain a better understanding of the protective duration of prior SARS-CoV-2 infection against reinfection. The objectives were two-fold: to assess the durability of immunity to SARS-CoV-2 reinfection among initially unvaccinated individuals with previous SARS-CoV-2 infection, and to evaluate the crude SARS-CoV-2 reinfection rate and associated risk factors. During the pandemic era time period from February 29, 2020, through April 30, 2021, 144,678,382 individuals with SARS-CoV-2 molecular diagnostic or antibody test results were studied. Rates of reinfection among index-positive individuals were compared to rates of infection among index-negative individuals. Factors associated with reinfection were evaluated using multivariable logistic regression. For both objectives, the outcome was a subsequent positive molecular diagnostic test result. Consistent with prior findings, the risk of reinfection among index-positive individuals was 87% lower than the risk of infection among index-negative individuals. The duration of protection against reinfection was stable over the median 5 months and up to 1-year follow-up interval. Factors associated with an increased reinfection risk included older age, comorbid immunologic conditions, and living in congregate care settings; healthcare workers had a decreased reinfection risk. This large US population-based study suggests that infection induced immunity is durable for variants circulating pre-Delta predominance.
Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , Reinfección/epidemiología , COVID-19/epidemiología , Anticuerpos , Personal de SaludRESUMEN
Objective: We aim to reduce overfitting and model overconfidence by distilling the knowledge of an ensemble of deep learning models into a single model for the classification of cancer pathology reports. Materials and Methods: We consider the text classification problem that involves 5 individual tasks. The baseline model consists of a multitask convolutional neural network (MtCNN), and the implemented ensemble (teacher) consists of 1000 MtCNNs. We performed knowledge transfer by training a single model (student) with soft labels derived through the aggregation of ensemble predictions. We evaluate performance based on accuracy and abstention rates by using softmax thresholding. Results: The student model outperforms the baseline MtCNN in terms of abstention rates and accuracy, thereby allowing the model to be used with a larger volume of documents when deployed. The highest boost was observed for subsite and histology, for which the student model classified an additional 1.81% reports for subsite and 3.33% reports for histology. Discussion: Ensemble predictions provide a useful strategy for quantifying the uncertainty inherent in labeled data and thereby enable the construction of soft labels with estimated probabilities for multiple classes for a given document. Training models with the derived soft labels reduce model confidence in difficult-to-classify documents, thereby leading to a reduction in the number of highly confident wrong predictions. Conclusions: Ensemble model distillation is a simple tool to reduce model overconfidence in problems with extreme class imbalance and noisy datasets. These methods can facilitate the deployment of deep learning models in high-risk domains with low computational resources where minimizing inference time is required.
RESUMEN
Follow-up of US cohort members for incident cancer is time-consuming, is costly, and often results in underascertainment when the traditional methods of self-reporting and/or medical record validation are used. We conducted one of the first large-scale investigations to assess the feasibility, methods, and benefits of linking participants in the US Radiologic Technologists (USRT) Study (n = 146,022) with the majority of US state or regional cancer registries. Follow-up of this cohort has relied primarily on questionnaires (mailed approximately every 10 years) and linkage with the National Death Index. We compared the level of agreement and completeness of questionnaire/death-certificate-based information with that of registry-based (43 registries) incident cancer follow-up in the USRT cohort. Using registry-identified first primary cancers from 1999-2012 as the gold standard, the overall sensitivity was 46.5% for self-reports only and 63.0% for both self-reports and death certificates. Among the 37.0% false-negative reports, 27.8% were due to dropout, while 9.2% were due to misreporting. The USRT cancer reporting patterns differed by cancer type. Our study indicates that linkage to state cancer registries would greatly improve completeness and accuracy of cancer follow-up in comparison with questionnaire self-reporting. These findings support ongoing development of a national US virtual pooled registry with which to streamline cohort linkages.
Asunto(s)
Certificado de Defunción , Neoplasias , Humanos , Estudios de Cohortes , Autoinforme , Incidencia , Neoplasias/epidemiología , Sistema de RegistrosRESUMEN
Objectives: The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. We developed deep learning-based information extraction models from cancer pathology reports based on the ICD-O-3 coding standard. In this article, we describe extending the models to perform ICCC classification. Materials and Methods: We developed 2 models, ICD-O-3 classification and ICCC recoding (Model 1) and direct ICCC classification (Model 2), and 4 scenarios subject to the training sample size. We evaluated these models with a corpus consisting of 29â206 reports with age at diagnosis between 0 and 19 from 6 state cancer registries. Results: Our findings suggest that the direct ICCC classification (Model 2) is substantially better than reusing the ICD-O-3 classification model (Model 1). Applying the uncertainty quantification mechanism to assess the confidence of the algorithm in assigning a code demonstrated that the model achieved a micro-F1 score of 0.987 while abstaining (not sufficiently confident to assign a code) on only 14.8% of ambiguous pathology reports. Conclusions: Our experimental results suggest that the machine learning-based automatic information extraction from childhood cancer pathology reports in the ICCC is a reliable means of supplementing human annotators at state cancer registries by reading and abstracting the majority of the childhood cancer pathology reports accurately and reliably.
RESUMEN
IMPORTANCE: Better understanding of the protective duration of prior SARS-CoV-2 infection against reinfection is needed. OBJECTIVE: Primary: To assess the durability of immunity to SARS-CoV-2 reinfection among initially unvaccinated individuals with previous SARS-CoV-2 infection. Secondary: Evaluate the crude SARS-CoV-2 reinfection rate and associated characteristics. DESIGN AND SETTING: Retrospective observational study of HealthVerity data among 144,678,382 individuals, during the pandemic era through April 2021. PARTICIPANTS: Individuals studied had SARS-CoV-2 molecular diagnostic or antibody index test results from February 29 through December 9, 2020, with â¥365 days of pre-index continuous closed medical enrollment, claims, or electronic health record activity. MAIN OUTCOMES AND MEASURES: Rates of reinfection among index-positive individuals were compared to rates of infection among index-negative individuals. Factors associated with reinfection were evaluated using multivariable logistic regression. For both objectives, the outcome was a subsequent positive molecular diagnostic test result. RESULTS: Among 22,786,982 individuals with index SARS-CoV-2 laboratory test data (2,023,341 index positive), the crude rate of reinfection during follow-up was significantly lower (9.89/1,000-person years) than that of primary infection (78.39/1,000 person years). Consistent with prior findings, the risk of reinfection among index-positive individuals was 87% lower than the risk of infection among index-negative individuals (hazard ratio, 0.13; 95% CI, 0.13, 0.13). The cumulative incidence of reinfection among index-positive individuals and infection among index-negative individuals was 0.85% (95% CI: 0.82%, 0.88%) and 6.2% (95% CI: 6.1%, 6.3%), respectively, over follow-up of 375 days. The duration of protection against reinfection was stable over the median 5 months and up to 1-year follow-up interval. Factors associated with an increased reinfection risk included older age, comorbid immunologic conditions, and living in congregate care settings; healthcare workers had a decreased reinfection risk. CONCLUSIONS AND RELEVANCE: This large US population-based study demonstrates that SARS-CoV-2 reinfection is uncommon among individuals with laboratory evidence of a previous infection. Protection from SARS-CoV-2 reinfection is stable up to one year. Reinfection risk was primarily associated with age 85+ years, comorbid immunologic conditions and living in congregate care settings; healthcare workers demonstrated a decreased reinfection risk. These findings suggest that infection induced immunity is durable for variants circulating prior to Delta. KEY POINTS: Question: How long does prior SARS-CoV-2 infection provide protection against SARS-CoV-2 reinfection?Finding: Among >22 million individuals tested February 2020 through April 2021, the relative risk of reinfection among those with prior infection was 87% lower than the risk of infection among individuals without prior infection. This protection was durable for up to a year. Factors associated with increased likelihood of reinfection included older age (85+ years), comorbid immunologic conditions, and living in congregate care settings; healthcare workers had lower risk.Meaning: Prior SARS-CoV-2 infection provides a durable, high relative degree of protection against reinfection.
RESUMEN
Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.
Asunto(s)
Reproducibilidad de los Resultados , Recolección de Datos , HumanosRESUMEN
Generating evidence on the use, effectiveness, and safety of new cancer therapies is a priority for researchers, health care providers, payers, and regulators given the rapid pace of change in cancer diagnosis and treatments. The use of real-world data (RWD) is integral to understanding the utilization patterns and outcomes of these new treatments among patients with cancer who are treated in clinical practice and community settings. An initial step in the use of RWD is careful study design to assess the suitability of an RWD source. This pivotal process can be guided by using a conceptual model that encourages predesign conceptualization. The primary types of RWD included are electronic health records, administrative claims data, cancer registries, and specialty data providers and networks. Careful consideration of each data type is necessary because they are collected for a specific purpose, capturing a set of data elements within a certain population for that purpose, and they vary by population coverage and longitudinality. In this review, the authors provide a high-level assessment of the strengths and limitations of each data category to inform data source selection appropriate to the study question. Overall, the development and accessibility of RWD sources for cancer research are rapidly increasing, and the use of these data requires careful consideration of composition and utility to assess important questions in understanding the use and effectiveness of new therapies.
Asunto(s)
Almacenamiento y Recuperación de la Información , Oncología Médica , Registros Electrónicos de Salud , Humanos , Sistema de Registros , Proyectos de InvestigaciónRESUMEN
The National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results (SEER) program is continuously exploring opportunities to augment its already extensive collection of data, enhance the quality of reported cancer information, and contribute to more comprehensive analyses of cancer burden. This manuscript describes a recent linkage of the LexisNexis longitudinal residential history data with 11 SEER registries and provides estimates of the inter-state mobility of SEER cancer patients. To identify mobility from one state to another, we used state postal abbreviations to generate state-level residential histories. From this, we determined how often cancer patients moved from state-to-state. The results in this paper provide information on the linkage with LexisNexis data and useful information on state-to-state residential mobility patterns of a large portion of US cancer patients for the most recent 1-, 2-, 3-, 4-, and 5-year periods. We show that mobility patterns vary by geographic area, race/ethnicity and age, and cancer patients tend to move less than the general population.
Asunto(s)
Neoplasias , Humanos , Estados Unidos/epidemiología , Neoplasias/epidemiología , Sistema de Registros , Dinámica Poblacional , Etnicidad , Programa de VERFRESUMEN
In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.