Pesquisa | BVS IEC

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

De Angeli, Kevin; Gao, Shang; Danciu, Ioana; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen; Wiggins, Charles; Damesyn, Mark; Coyle, Linda; Penberthy, Lynne; Tourassi, Georgia D; Yoon, Hong-Jun.

J Biomed Inform ; 125: 103957, 2022 01.

Artigo em Inglês | MEDLINE | ID: mdl-34823030

RESUMO

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.

Assuntos

Processamento de Linguagem Natural , Neoplasias , Registros Eletrônicos de Saúde , Humanos , Aprendizado de Máquina , Redes Neurais de Computação

Tuberculosis and HIV co-infection, California, USA, 19932008.

Metcalfe, John Z; Porco, Travis C; Westenhouse, Janice; Damesyn, Mark; Facer, Matt; Hill, Julia; Xia, Qiang; Watt, James P; Hopewell, Philip C; Flood, Jennifer.

Emerg Infect Dis ; 19(3): 400-6, 2013 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-23745218

RESUMO

To understand the epidemiology of tuberculosis (TB) and HIV co-infection in California, we cross-matched incident TB cases reported to state surveillance systems during 19932008 with cases in the state HIV/AIDS registry. Of 57,527 TB case-patients, 3,904 (7%) had known HIV infection. TB rates for persons with HIV declined from 437 to 126 cases/100,000 persons during 19932008; rates were highest for Hispanics (225/100,000) and Blacks (148/100,000). Patients co-infected with TBHIV during 20012008 were significantly more likely than those infected before highly active antiretroviral therapy became available to be foreign born, Hispanic, or Asian/Pacific Islander and to have pyrazinamide-monoresistant TB. Death rates decreased after highly active antiretroviral therapy became available but remained twice that for TB patients without HIV infection and higher for women. In California, HIV-associated TB has concentrated among persons from low- and middle-income countries who often acquire HIV infection in the peri-immigration period.

Assuntos

Coinfecção/epidemiologia , Infecções por HIV/epidemiologia , Tuberculose Pulmonar/epidemiologia , Adolescente , Adulto , Distribuição por Idade , Terapia Antirretroviral de Alta Atividade , California/epidemiologia , Coinfecção/tratamento farmacológico , Monitoramento Epidemiológico , Feminino , Infecções por HIV/tratamento farmacológico , Humanos , Incidência , Masculino , Pessoa de Meia-Idade , Prevalência , Estudos Retrospectivos , Adulto Jovem

Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology.

Yoon, Hong-Jun; Stanley, Christopher; Christian, J Blair; Klasky, Hilda B; Blanchard, Andrew E; Durbin, Eric B; Wu, Xiao-Cheng; Stroup, Antoinette; Doherty, Jennifer; Schwartz, Stephen M; Wiggins, Charles; Damesyn, Mark; Coyle, Linda; Tourassi, Georgia D.

Cancer Biomark ; 33(2): 185-198, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35213361

RESUMO

BACKGROUND: With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. OBJECTIVE: The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients' information to mitigate confidentiality breaches. METHODS: The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments. RESULTS: The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.

Assuntos

Confidencialidade , Aprendizado Profundo , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Neoplasias/epidemiologia , Inteligência Artificial , Aprendizado Profundo/normas , Humanos , Neoplasias/patologia , Sistema de Registros

Next Generation of Central Cancer Registries.

Wormeli, Paul; Mazreku, Jenna; Pine, Jeremy; Damesyn, Mark.

JCO Clin Cancer Inform ; 5: 288-294, 2021 03.

Artigo em Inglês | MEDLINE | ID: mdl-33760641

RESUMO

For central cancer registries to become a more significant public health resource, they must evolve to capture more timely, accurate, and extensive data. Key stakeholders have called for a faster time to deliver work products, data extensions such as social determinants of health, and more relevant information for cancer control programs at the local level. The proposed model consists of near real-time reporting stages to replace the current time and labor-intensive efforts to populate a complete cancer case abstract on the basis of the 12- and 24-month data submission timelines. The first stage collects a cancer diagnosis minimum data set sufficient to describe population incidence and prevalence, which is then followed by a second stage capturing subsequent case updates and treatment data. A third stage procures targeted information in response to identified research projects' needs. The model also provides for further supplemental reports as may be defined to gather additional data. All stages leverage electronic health records' widespread development and the many emerging standards for data content, including national policies related to healthcare and technical standards for interoperability, such as the Fast Healthcare Interoperability Resources specifications to automate and accelerate reporting to central cancer registries. The emergence of application programming interfaces that allow for more interoperability among systems would be leveraged, leading to more efficient information sharing. Adopting this model will expedite cancer data availability to improve cancer control while supporting data integrity and flexibility in data items. It presents a long-term and feasible solution that addresses the extensive burden and unsustainable manual data collection requirements placed on Certified Tumor Registrars at disease reporting entities nationally.

Assuntos

Gerenciamento de Dados , Neoplasias , Coleta de Dados , Registros Eletrônicos de Saúde , Humanos , Neoplasias/diagnóstico , Neoplasias/epidemiologia , Neoplasias/terapia , Sistema de Registros

Infant Mortality: Development of a Proposed Update to the Dollfus Classification of Infant Deaths.

Nakamura, Ann M; Dove, Melanie S; Minnal, Archana; Damesyn, Mark; Curtis, Michael P.

Public Health Rep ; 130(6): 632-42, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26556935

RESUMO

OBJECTIVE: Identifying infant deaths with common underlying causes and potential intervention points is critical to infant mortality surveillance and the development of prevention strategies. We constructed an International Classification of Diseases 10th Revision (ICD-10) parallel to the Dollfus cause-of-death classification scheme first published in 1990, which organized infant deaths by etiology and their amenability to prevention efforts. METHODS: Infant death records for 1996, dual-coded to the ICD Ninth Revision (ICD-9) and ICD-10, were obtained from the CDC public-use multiple-cause-of-death file on comparability between ICD-9 and ICD-10. We used the underlying cause of death to group 27,821 infant deaths into the nine categories of the ICD-9-based update to Dollfus' original coding scheme, published by Sowards in 1999. Comparability ratios were computed to measure concordance between ICD versions. RESULTS: The Dollfus classification system updated with ICD-10 codes had limited agreement with the 1999 modified classification system. Although prematurity, congenital malformations, Sudden Infant Death Syndrome, and obstetric conditions were the first through fourth most common causes of infant death under both systems, most comparability ratios were significantly different from one system to the other. CONCLUSION: The Dollfus classification system can be adapted for use with ICD-10 codes to create a comprehensive, etiology-based profile of infant deaths. The potential benefits of using Dollfus logic to guide perinatal mortality reduction strategies, particularly to maternal and child health programs and other initiatives focused on improving infant health, warrant further examination of this method's use in perinatal mortality surveillance.

Assuntos

Codificação Clínica , Morte do Lactente/etiologia , Classificação Internacional de Doenças , Causas de Morte , Humanos , Lactente

Estimated HIV Incidence in California, 2006-2009.

Scheer, Susan; Nakelsky, Shoshanna; Bingham, Trista; Damesyn, Mark; Sun, Dan; Chin, Chi-Sheng; Buckman, Anthony; Mark, Karen E.

PLoS One ; 8(2): e55002, 2013.

Artigo em Inglês | MEDLINE | ID: mdl-23405106

RESUMO

INTRODUCTION: Accurate estimates of HIV incidence are crucial for prioritizing, targeting, and evaluating HIV prevention efforts. Using the methodology the CDC used to estimate national HIV incidence, we estimated HIV incidence in Los Angeles County (LAC), San Francisco (SF), and California's remaining counties. METHODS: We estimated new HIV infections in 2006-2009 among adults and adolescents in LAC, SF and the remaining California counties using the Serologic Testing Algorithm for Recent Seroconversion (STARHS). STARHS methodology uses the BED HIV-1 capture enzyme immunoassay to determine recent HIV infections by testing remnant serum from persons newly diagnosed with HIV. A population-based incidence estimate is calculated using HIV testing data from newly diagnosed cases and imputing for persons unaware of their HIV infection. RESULTS: For years 2007-2009, respectively, we estimated new infections in LAC to be 2426 (95% CI 1871-2982), 1669 (CI 1309-2029) and 1898 (CI 1452-2344) (p<0.01); in SF for 2006-2009, 492 (CI 327-657), 490 (CI 335-646), 458 (CI 342-574) and 367 (CI 261-473) (p = 0.14); and in the remaining California counties in 2008-2009, 2526 (CI 1688-3364) and 2993 (CI 2141-3846) respectively. HIV infection rates among men who have sex with men (MSM) in LAC were 100 times higher than other risk populations; the SF MSM rate was 3 to 18 times higher than other demographic groups. In LAC, incidence rates among African-Americans were twice those of whites and Latinos; persons 40 years or older had lower rates of infection than younger persons. DISCUSSION: We report the first HIV incidence estimates for California, highlighting geographic disparities in HIV incidence and confirming national findings that MSM and African-Americans are disproportionately impacted by HIV. HIV incidence estimates can and should be used to target prevention efforts towards populations at highest risk of acquiring new HIV infections, focusing on geographic, racial and risk group disparities.

Assuntos

Síndrome da Imunodeficiência Adquirida/epidemiologia , Infecções por HIV/epidemiologia , HIV-1/isolamento & purificação , Síndrome da Imunodeficiência Adquirida/diagnóstico , Adolescente , Adulto , California/epidemiologia , Feminino , Infecções por HIV/diagnóstico , Homossexualidade Masculina/estatística & dados numéricos , Humanos , Incidência , Masculino , Comportamento Sexual/psicologia , Adulto Jovem

Regional differences among HIV patients in care: California medical monitoring project sites, 2007-2008.

Scheer, Susan; Hughes, Alison J; Tejero, Judith; Damesyn, Mark A; Mark, Karen E; Arguello, Tyler M; Wohl, Amy R.

Open AIDS J ; 6: 188-95, 2012.

Artigo em Inglês | MEDLINE | ID: mdl-23049669

RESUMO

INTRODUCTION: The Medical Monitoring Project (MMP) is a national, multi-site population-based supplemental HIV/AIDS surveillance project of persons receiving HIV/AIDS care. We compared California MMP data by region. Demographic characteristics, medical care experiences, HIV treatment, clinical care outcomes, and need for support services are described. METHODS: HIV-infected patients 18 years or older were randomly selected from medical care facilities. In person structured interviews from 2007 - 2008 were used to assess sociodemographic characteristics, self-reported clinical outcomes, and need for supportive services. Pearson chi-squared, Fisher's exact and Kruskal-Wallis p-values were calculated to compare regional differences. RESULTS: Between 2007 and 2008, 899 people were interviewed: 329 (37%) in San Francisco (SF), 333 (37%) in Los Angeles (LA) and 237 (26%) in other California counties. Significant regional sociodemographic differences were found. Care received and clinical outcomes for patients in MMP were positive and few regional differences were identified. HIV case management (36%), mental health counseling (35%), and dental services (29%) were the supportive services patients most frequently needed. Unmet needs for supportive services were low overall. Significant differences by region in needed and unmet need services were identified. DISCUSSION: The majority of MMP respondents reported standard of care CD4 and viral load monitoring, high treatment use, undetectable HIV viral loads and CD4 counts indicative of good immune function and treatment efficacy. Information from MMP can be used by planning councils, policymakers, and HIV care providers to improve access to care and prevention. Identifying regional differences can facilitate sharing of best practices among health jurisdictions.

Lapsed donors: an untapped resource.

Schreiber, George B; Glynn, Simone A; Damesyn, Mark A; Wright, David J; Tu, Yongling; Dodd, Roger Y; Murphy, Edward L.

Transfusion ; 43(1): 17-24, 2003 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-12519426

RESUMO

BACKGROUND: There is a clear need for methods to recruit and retain donors without compromising blood safety. Although prior studies report lower viral prevalence rates in repeat donors than those in first-time donors, it is unknown if this relationship holds after a lapse of several years between donations. STUDY DESIGN AND METHODS: A total of 6.4 million allogeneic donations collected at five US blood centers from 1991 through 1998 were classified by donation history (first-time vs. repeat) and by length of time between donations (lapsed interval length). The prevalence of HCV, HIV, and HBsAg was compared by donation history and lapsed interval length. The relationship between lapsed interval length and donor demographics was explored. RESULTS: Repeat donors who delayed their return for over 5 years were significantly less likely to test positive for a viral infection than were first-time donors. The likelihood of a positive test result appeared to increase steadily with lapsed interval length for HCV and HBsAg, but not for HIV. Younger, less educated, and nonwhite donors were less likely to return than others. CONCLUSIONS: Recruitment of donors who have not returned for several years could be an effective way to increase the blood supply while preserving blood safety. Understanding the relationship of donor demographics to return behavior is important for recruitment efforts.

Assuntos

Doadores de Sangue , Adulto , Fatores Etários , Doadores de Sangue/estatística & dados numéricos , Doadores de Sangue/provisão & distribuição , Infecções por HIV/epidemiologia , Antígenos de Superfície da Hepatite B/análise , Hepatite C/epidemiologia , Humanos , Pessoa de Meia-Idade , Prevalência , Fatores de Tempo

Behavioral and infectious disease risks in young blood donors: implications for recruitment.

Damesyn, Mark A; Glynn, Simone A; Schreiber, George B; Ownby, Helen E; Bethel, James; Fridey, Joy; McMullen, Quentin; Garratty, George; Busch, Michael P.

Transfusion ; 43(11): 1596-603, 2003 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-14617320

RESUMO

BACKGROUND: Recruitment of young donors is critical to expand the donor base and sustain the blood supply. Nevertheless, there is concern that younger blood donors may have a higher risk profile than their older counterparts. STUDY DESIGN AND METHODS: The prevalence of behavioral risks associated with transfusion-transmissible viral infections and the incidence of viral markers were compared between younger and older donors. Behavioral risks included unreported deferrable risks (UDRs) and HIV test seeking estimated from anonymous donor surveys administered in 1993 and 1998. The incidence of HIV, HCV, or HBV was estimated from donors giving at five US blood centers between 1996 and 2000. RESULTS: Donors younger than 25 years of age were significantly more likely to report a UDR or HIV test seeking than those 25 years or older. ORs comparing donors 18 to 19 and 20 to 24 years of age to those 25 years or older were 2.0 (95% CI, 1.5-2.6) and 1.5 (95% CI, 1.2-1.9) for UDR and 4.5 (95% CI, 3.0-6.9) and 5.5 (95% CI, 4.2-7.1) for test seeking, respectively. Although incidence estimates did not significantly differ between age groups, HIV incidence appeared to be highest in 18- to 19-year-old donors, whereas HBV incidence was highest in 20- to 24-year-old donors. CONCLUSIONS: Donors younger than 25 years of age appeared to have a higher behavioral risk profile than older donors. The message not to donate when a behavioral risk is present or for obtaining HIV tests needs to be reinforced in younger donors.

Assuntos

Doadores de Sangue/psicologia , Doenças Transmissíveis/etiologia , Assunção de Riscos , Adulto , Distribuição por Idade , Testes Diagnósticos de Rotina , Feminino , Infecções por HIV/diagnóstico , Infecções por HIV/epidemiologia , Hepatite B/epidemiologia , Humanos , Incidência , Masculino , Aceitação pelo Paciente de Cuidados de Saúde/estatística & dados numéricos , Seleção de Pessoal , Medição de Risco

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA