Search | VHL Regional Portal

1.

Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge.

Holste, Gregory; Zhou, Yiliang; Wang, Song; Jaiswal, Ajay; Lin, Mingquan; Zhuge, Sherry; Yang, Yuzhe; Kim, Dongkyun; Nguyen-Mau, Trong-Hieu; Tran, Minh-Triet; Jeong, Jaehyup; Park, Wongi; Ryu, Jongbin; Hong, Feng; Verma, Arsh; Yamagishi, Yosuke; Kim, Changhyun; Seo, Hyeryeong; Kang, Myungjoo; Celi, Leo Anthony; Lu, Zhiyong; Summers, Ronald M; Shih, George; Wang, Zhangyang; Peng, Yifan.

Med Image Anal ; 97: 103224, 2024 May 31.

Article in English | MEDLINE | ID: mdl-38850624

ABSTRACT

Many real-world image recognition problems, such as diagnostic medical imaging exams, are "long-tailed" - there are a few common findings followed by many more relatively rare conditions. In chest radiography, diagnosis is both a long-tailed and multi-label problem, as patients often present with multiple findings simultaneously. While researchers have begun to study the problem of long-tailed learning in medical image recognition, few have studied the interaction of label imbalance and label co-occurrence posed by long-tailed, multi-label disease classification. To engage with the research community on this emerging topic, we conducted an open challenge, CXR-LT, on long-tailed, multi-label thorax disease classification from chest X-rays (CXRs). We publicly release a large-scale benchmark dataset of over 350,000 CXRs, each labeled with at least one of 26 clinical findings following a long-tailed distribution. We synthesize common themes of top-performing solutions, providing practical recommendations for long-tailed, multi-label medical image classification. Finally, we use these insights to propose a path forward involving vision-language foundation models for few- and zero-shot disease classification.

2.

Error-compensation network for ringing artifact reduction in holographic displays.

Yuan, Ganzhangqin; Zhou, Mi; Peng, Yifan; Chen, Muku; Geng, Zihan.

Opt Lett ; 49(11): 3210-3213, 2024 Jun 01.

Article in English | MEDLINE | ID: mdl-38824365

ABSTRACT

Recent advances in learning-based computer-generated holography (CGH) have unlocked novel possibilities for crafting phase-only holograms. However, existing approaches primarily focus on the learning ability of network modules, often neglecting the impact of diffraction propagation models. The resulting ringing artifacts, emanating from the Gibbs phenomenon in the propagation model, can degrade the quality of reconstructed holographic images. To this end, we explore a diffraction propagation error-compensation network that can be easily integrated into existing CGH methods. This network is designed to correct propagation errors by predicting residual values, thereby aligning the diffraction process closely with an ideal state and easing the learning burden of the network. Simulations and optical experiments demonstrate that our method, when applied to state-of-the-art HoloNet and CCNN, achieves PSNRs of up to 32.47 dB and 29.53 dB, respectively, surpassing baseline methods by 3.89 dB and 0.62 dB. Additionally, real-world experiments have confirmed a significant reduction in ringing artifacts. We envision this approach being applied to a variety of CGH algorithms, paving the way for improved holographic displays.

3.

Deep learning with noisy labels in medical prediction problems: a scoping review.

Wei, Yishu; Deng, Yu; Sun, Cong; Lin, Mingquan; Jiang, Hongmei; Peng, Yifan.

J Am Med Inform Assoc ; 31(7): 1596-1607, 2024 Jun 20.

Article in English | MEDLINE | ID: mdl-38814164

ABSTRACT

OBJECTIVES: Medical research faces substantial challenges from noisy labels attributed to factors like inter-expert variability and machine-extracted labels. Despite this, the adoption of label noise management remains limited, and label noise is largely ignored. To this end, there is a critical need to conduct a scoping review focusing on the problem space. This scoping review aims to comprehensively review label noise management in deep learning-based medical prediction problems, which includes label noise detection, label noise handling, and evaluation. Research involving label uncertainty is also included. METHODS: Our scoping review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We searched 4 databases, including PubMed, IEEE Xplore, Google Scholar, and Semantic Scholar. Our search terms include "noisy label AND medical/healthcare/clinical," "uncertainty AND medical/healthcare/clinical," and "noise AND medical/healthcare/clinical." RESULTS: A total of 60 papers met inclusion criteria between 2016 and 2023. A series of practical questions in medical research are investigated. These include the sources of label noise, the impact of label noise, the detection of label noise, label noise handling techniques, and their evaluation. Categorization of both label noise detection methods and handling techniques are provided. DISCUSSION: From a methodological perspective, we observe that the medical community has been up to date with the broader deep-learning community, given that most techniques have been evaluated on medical data. We recommend considering label noise as a standard element in medical research, even if it is not dedicated to handling noisy labels. Initial experiments can start with easy-to-implement methods, such as noise-robust loss functions, weighting, and curriculum learning.

Subject(s)

Deep Learning , Humans , Biomedical Research

4.

Erratum for: Evaluating GPT-4V (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs.

Zhou, Yiliang; Ong, Hanley; Kennedy, Patrick; Wu, Carol C; Kazam, Jacob; Hentel, Keith; Flanders, Adam; Shih, George; Peng, Yifan.

Radiology ; 311(2): e249016, 2024 May.

Article in English | MEDLINE | ID: mdl-38805735

5.

Evaluating GPT-V4 (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs.

Zhou, Yiliang; Ong, Hanley; Kennedy, Patrick; Wu, Carol C; Kazam, Jacob; Hentel, Keith; Flanders, Adam; Shih, George; Peng, Yifan.

Radiology ; 311(2): e233270, 2024 May.

Article in English | MEDLINE | ID: mdl-38713028

ABSTRACT

Background Generating radiologic findings from chest radiographs is pivotal in medical image analysis. The emergence of OpenAI's generative pretrained transformer, GPT-4 with vision (GPT-4V), has opened new perspectives on the potential for automated image-text pair generation. However, the application of GPT-4V to real-world chest radiography is yet to be thoroughly examined. Purpose To investigate the capability of GPT-4V to generate radiologic findings from real-world chest radiographs. Materials and Methods In this retrospective study, 100 chest radiographs with free-text radiology reports were annotated by a cohort of radiologists, two attending physicians and three residents, to establish a reference standard. Of 100 chest radiographs, 50 were randomly selected from the National Institutes of Health (NIH) chest radiographic data set, and 50 were randomly selected from the Medical Imaging and Data Resource Center (MIDRC). The performance of GPT-4V at detecting imaging findings from each chest radiograph was assessed in the zero-shot setting (where it operates without prior examples) and few-shot setting (where it operates with two examples). Its outcomes were compared with the reference standard with regards to clinical conditions and their corresponding codes in the International Statistical Classification of Diseases, Tenth Revision (ICD-10), including the anatomic location (hereafter, laterality). Results In the zero-shot setting, in the task of detecting ICD-10 codes alone, GPT-4V attained an average positive predictive value (PPV) of 12.3%, average true-positive rate (TPR) of 5.8%, and average F1 score of 7.3% on the NIH data set, and an average PPV of 25.0%, average TPR of 16.8%, and average F1 score of 18.2% on the MIDRC data set. When both the ICD-10 codes and their corresponding laterality were considered, GPT-4V produced an average PPV of 7.8%, average TPR of 3.5%, and average F1 score of 4.5% on the NIH data set, and an average PPV of 10.9%, average TPR of 4.9%, and average F1 score of 6.4% on the MIDRC data set. With few-shot learning, GPT-4V showed improved performance on both data sets. When contrasting zero-shot and few-shot learning, there were improved average TPRs and F1 scores in the few-shot setting, but there was not a substantial increase in the average PPV. Conclusion Although GPT-4V has shown promise in understanding natural images, it had limited effectiveness in interpreting real-world chest radiographs. © RSNA, 2024 Supplemental material is available for this article.

Subject(s)

Radiography, Thoracic , Humans , Radiography, Thoracic/methods , Retrospective Studies , Female , Male , Middle Aged , Radiographic Image Interpretation, Computer-Assisted/methods , Aged , Adult

6.

Full-colour 3D holographic augmented-reality displays with metasurface waveguides.

Gopakumar, Manu; Lee, Gun-Yeal; Choi, Suyeon; Chao, Brian; Peng, Yifan; Kim, Jonghyun; Wetzstein, Gordon.

Nature ; 629(8013): 791-797, 2024 May.

Article in English | MEDLINE | ID: mdl-38720077

ABSTRACT

Emerging spatial computing systems seamlessly superimpose digital information on the physical environment observed by a user, enabling transformative experiences across various domains, such as entertainment, education, communication and training1-3. However, the widespread adoption of augmented-reality (AR) displays has been limited due to the bulky projection optics of their light engines and their inability to accurately portray three-dimensional (3D) depth cues for virtual content, among other factors4,5. Here we introduce a holographic AR system that overcomes these challenges using a unique combination of inverse-designed full-colour metasurface gratings, a compact dispersion-compensating waveguide geometry and artificial-intelligence-driven holography algorithms. These elements are co-designed to eliminate the need for bulky collimation optics between the spatial light modulator and the waveguide and to present vibrant, full-colour, 3D AR content in a compact device form factor. To deliver unprecedented visual quality with our prototype, we develop an innovative image formation model that combines a physically accurate waveguide model with learned components that are automatically calibrated using camera feedback. Our unique co-design of a nanophotonic metasurface waveguide and artificial-intelligence-driven holographic algorithms represents a significant advancement in creating visually compelling 3D AR experiences in a compact wearable device.

7.

Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness.

Zhang, Gongbo; Jin, Qiao; Jered McInerney, Denis; Chen, Yong; Wang, Fei; Cole, Curtis L; Yang, Qian; Wang, Yanshan; Malin, Bradley A; Peleg, Mor; Wallace, Byron C; Lu, Zhiyong; Weng, Chunhua; Peng, Yifan.

J Biomed Inform ; 153: 104640, 2024 May.

Article in English | MEDLINE | ID: mdl-38608915

ABSTRACT

Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.

Subject(s)

Artificial Intelligence , Evidence-Based Medicine , Humans , Trust , Natural Language Processing

8.

A survey of recent methods for addressing AI fairness and bias in biomedicine.

Yang, Yifan; Lin, Mingquan; Zhao, Han; Peng, Yifan; Huang, Furong; Lu, Zhiyong.

J Biomed Inform ; 154: 104646, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38677633

ABSTRACT

OBJECTIVES: Artificial intelligence (AI) systems have the potential to revolutionize clinical practices, including improving diagnostic accuracy and surgical decision-making, while also reducing costs and manpower. However, it is important to recognize that these systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender. Such biases can occur before, during, or after the development of AI models, making it critical to understand and address potential biases to enable the accurate and reliable application of AI models in clinical settings. To mitigate bias concerns during model development, we surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV). Then we discussed the methods, such as data perturbation and adversarial learning, that have been applied in the biomedical domain to address bias. METHODS: We performed our literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords. We then filtered the result of 10,041 articles automatically with loose constraints, and manually inspected the abstracts of the remaining 890 articles to identify the 55 articles included in this review. Additional articles in the references are also included in this review. We discuss each method and compare its strengths and weaknesses. Finally, we review other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness. RESULTS: The bias of AIs in biomedicine can originate from multiple sources such as insufficient data, sampling bias and the use of health-irrelevant features or race-adjusted algorithms. Existing debiasing methods that focus on algorithms can be categorized into distributional or algorithmic. Distributional methods include data augmentation, data perturbation, data reweighting methods, and federated learning. Algorithmic approaches include unsupervised representation learning, adversarial learning, disentangled representation learning, loss-based methods and causality-based methods.

Subject(s)

Artificial Intelligence , Bias , Natural Language Processing , Humans , Surveys and Questionnaires , Machine Learning , Algorithms

9.

Identifying social determinants of health from clinical narratives: A study of performance, documentation ratio, and potential bias.

Yu, Zehao; Peng, Cheng; Yang, Xi; Dang, Chong; Adekkanattu, Prakash; Gopal Patra, Braja; Peng, Yifan; Pathak, Jyotishman; Wilson, Debbie L; Chang, Ching-Yuan; Lo-Ciganic, Wei-Hsuan; George, Thomas J; Hogan, William R; Guo, Yi; Bian, Jiang; Wu, Yonghui.

J Biomed Inform ; 153: 104642, 2024 May.

Article in English | MEDLINE | ID: mdl-38621641

ABSTRACT

OBJECTIVE: To develop a natural language processing (NLP) package to extract social determinants of health (SDoH) from clinical narratives, examine the bias among race and gender groups, test the generalizability of extracting SDoH for different disease groups, and examine population-level extraction ratio. METHODS: We developed SDoH corpora using clinical notes identified at the University of Florida (UF) Health. We systematically compared 7 transformer-based large language models (LLMs) and developed an open-source package - SODA (i.e., SOcial DeterminAnts) to facilitate SDoH extraction from clinical narratives. We examined the performance and potential bias of SODA for different race and gender groups, tested the generalizability of SODA using two disease domains including cancer and opioid use, and explored strategies for improvement. We applied SODA to extract 19 categories of SDoH from the breast (n = 7,971), lung (n = 11,804), and colorectal cancer (n = 6,240) cohorts to assess patient-level extraction ratio and examine the differences among race and gender groups. RESULTS: We developed an SDoH corpus using 629 clinical notes of cancer patients with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH, and another cross-disease validation corpus using 200 notes from opioid use patients with 4,342 SDoH concepts/attributes. We compared 7 transformer models and the GatorTron model achieved the best mean average strict/lenient F1 scores of 0.9122 and 0.9367 for SDoH concept extraction and 0.9584 and 0.9593 for linking attributes to SDoH concepts. There is a small performance gap (â¼4%) between Males and Females, but a large performance gap (>16 %) among race groups. The performance dropped when we applied the cancer SDoH model to the opioid cohort; fine-tuning using a smaller opioid SDoH corpus improved the performance. The extraction ratio varied in the three cancer cohorts, in which 10 SDoH could be extracted from over 70 % of cancer patients, but 9 SDoH could be extracted from less than 70 % of cancer patients. Individuals from the White and Black groups have a higher extraction ratio than other minority race groups. CONCLUSIONS: Our SODA package achieved good performance in extracting 19 categories of SDoH from clinical narratives. The SODA package with pre-trained transformer models is available at https://github.com/uf-hobi-informatics-lab/SODA_Docker.

Subject(s)

Narration , Natural Language Processing , Social Determinants of Health , Humans , Female , Male , Bias , Electronic Health Records , Documentation/methods , Data Mining/methods

10.

Retrieval augmented scientific claim verification.

Liu, Hao; Soroush, Ali; Nestor, Jordan G; Park, Elizabeth; Idnay, Betina; Fang, Yilu; Pan, Jane; Liao, Stan; Bernard, Marguerite; Peng, Yifan; Weng, Chunhua.

JAMIA Open ; 7(1): ooae021, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38455840

ABSTRACT

Objective: To automate scientific claim verification using PubMed abstracts. Materials and Methods: We developed CliVER, an end-to-end scientific Claim VERification system that leverages retrieval-augmented techniques to automatically retrieve relevant clinical trial abstracts, extract pertinent sentences, and use the PICO framework to support or refute a scientific claim. We also created an ensemble of three state-of-the-art deep learning models to classify rationale of support, refute, and neutral. We then constructed CoVERt, a new COVID VERification dataset comprising 15 PICO-encoded drug claims accompanied by 96 manually selected and labeled clinical trial abstracts that either support or refute each claim. We used CoVERt and SciFact (a public scientific claim verification dataset) to assess CliVER's performance in predicting labels. Finally, we compared CliVER to clinicians in the verification of 19 claims from 6 disease domains, using 189 648 PubMed abstracts extracted from January 2010 to October 2021. Results: In the evaluation of label prediction accuracy on CoVERt, CliVER achieved a notable F1 score of 0.92, highlighting the efficacy of the retrieval-augmented models. The ensemble model outperforms each individual state-of-the-art model by an absolute increase from 3% to 11% in the F1 score. Moreover, when compared with four clinicians, CliVER achieved a precision of 79.0% for abstract retrieval, 67.4% for sentence selection, and 63.2% for label prediction, respectively. Conclusion: CliVER demonstrates its early potential to automate scientific claim verification using retrieval-augmented strategies to harness the wealth of clinical trial abstracts in PubMed. Future studies are warranted to further test its clinical utility.

11.

Decoding Suicide Decedent Profiles and Signs of Suicidal Intent Using Latent Class Analysis.

Xiao, Yunyu; Bi, Kaiwen; Yip, Paul Siu-Fai; Cerel, Julie; Brown, Timothy T; Peng, Yifan; Pathak, Jyotishman; Mann, J John.

JAMA Psychiatry ; 81(6): 595-605, 2024 Jun 01.

Article in English | MEDLINE | ID: mdl-38506817

ABSTRACT

Importance: Suicide rates in the US increased by 35.6% from 2001 to 2021. Given that most individuals die on their first attempt, earlier detection and intervention are crucial. Understanding modifiable risk factors is key to effective prevention strategies. Objective: To identify distinct suicide profiles or classes, associated signs of suicidal intent, and patterns of modifiable risks for targeted prevention efforts. Design, Setting, and Participants: This cross-sectional study used data from the 2003-2020 National Violent Death Reporting System Restricted Access Database for 306â¯800 suicide decedents. Statistical analysis was performed from July 2022 to June 2023. Exposures: Suicide decedent profiles were determined using latent class analyses of available data on suicide circumstances, toxicology, and methods. Main Outcomes and Measures: Disclosure of recent intent, suicide note presence, and known psychotropic usage. Results: Among 306â¯800 suicide decedents (mean [SD] age, 46.3 [18.4] years; 239â¯627 males [78.1%] and 67â¯108 females [21.9%]), 5 profiles or classes were identified. The largest class, class 4 (97â¯175 [31.7%]), predominantly faced physical health challenges, followed by polysubstance problems in class 5 (58â¯803 [19.2%]), and crisis, alcohol-related, and intimate partner problems in class 3 (55â¯367 [18.0%]), mental health problems (class 2, 53â¯928 [17.6%]), and comorbid mental health and substance use disorders (class 1, 41â¯527 [13.5%]). Class 4 had the lowest rates of disclosing suicidal intent (13â¯952 [14.4%]) and leaving a suicide note (24â¯351 [25.1%]). Adjusting for covariates, compared with class 1, class 4 had the highest odds of not disclosing suicide intent (odds ratio [OR], 2.58; 95% CI, 2.51-2.66) and not leaving a suicide note (OR, 1.45; 95% CI, 1.41-1.49). Class 4 also had the lowest rates of all known psychiatric illnesses and psychotropic medications among all suicide profiles. Class 4 had more older adults (23â¯794 were aged 55-70 years [24.5%]; 20â¯100 aged ≥71 years [20.7%]), veterans (22â¯220 [22.9%]), widows (8633 [8.9%]), individuals with less than high school education (15â¯690 [16.1%]), and rural residents (23â¯966 [24.7%]). Conclusions and Relevance: This study identified 5 distinct suicide profiles, highlighting a need for tailored prevention strategies. Improving the detection and treatment of coexisting mental health conditions, substance and alcohol use disorders, and physical illnesses is paramount. The implementation of means restriction strategies plays a vital role in reducing suicide risks across most of the profiles, reinforcing the need for a multifaceted approach to suicide prevention.

Subject(s)

Latent Class Analysis , Humans , Male , Female , Middle Aged , Cross-Sectional Studies , Adult , United States/epidemiology , Suicidal Ideation , Aged , Suicide, Attempted/statistics & numerical data , Suicide, Attempted/psychology , Young Adult , Suicide, Completed/statistics & numerical data , Suicide, Completed/psychology , Risk Factors , Suicide/statistics & numerical data , Suicide/psychology , Adolescent , Substance-Related Disorders/epidemiology , Substance-Related Disorders/psychology

12.

A span-based model for extracting overlapping PICO entities from randomized controlled trial publications.

Zhang, Gongbo; Zhou, Yiliang; Hu, Yan; Xu, Hua; Weng, Chunhua; Peng, Yifan.

J Am Med Inform Assoc ; 31(5): 1163-1171, 2024 Apr 19.

Article in English | MEDLINE | ID: mdl-38471120

ABSTRACT

OBJECTIVES: Extracting PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method, PICOX, to extract overlapping PICO entities. MATERIALS AND METHODS: PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then, it uses a multi-label classifier to assign one or more PICO labels to a span candidate. PICOX was evaluated using 1 of the best-performing baselines, EBM-NLP, and 3 more datasets, ie, PICO-Corpus and randomized controlled trial publications on Alzheimer's Disease (AD) or COVID-19, using entity-level precision, recall, and F1 scores. RESULTS: PICOX achieved superior precision, recall, and F1 scores across the board, with the micro F1 score improving from 45.05 to 50.87 (P âª.01). On the PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline and improved the micro recall score from 56.66 to 67.33. On the COVID-19 dataset, PICOX also outperformed the baseline and improved the micro F1 score from 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision when compared to the baseline. CONCLUSION: PICOX excels in identifying overlapping entities and consistently surpasses a leading baseline across multiple datasets. Ablation studies reveal that its data augmentation strategy effectively minimizes false positives and improves precision.

Subject(s)

Alzheimer Disease , COVID-19 , Humans , Natural Language Processing

13.

A survey of recent methods for addressing AI fairness and bias in biomedicine.

Yang, Yifan; Lin, Mingquan; Zhao, Han; Peng, Yifan; Huang, Furong; Lu, Zhiyong.

ArXiv ; 2024 Feb 13.

Article in English | MEDLINE | ID: mdl-38529077

ABSTRACT

Objectives: Artificial intelligence (AI) systems have the potential to revolutionize clinical practices, including improving diagnostic accuracy and surgical decision-making, while also reducing costs and manpower. However, it is important to recognize that these systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender. Such biases can occur before, during, or after the development of AI models, making it critical to understand and address potential biases to enable the accurate and reliable application of AI models in clinical settings. To mitigate bias concerns during model development, we surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV). Then we discussed the methods, such as data perturbation and adversarial learning, that have been applied in the biomedical domain to address bias. Methods: We performed our literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords. We then filtered the result of 10,041 articles automatically with loose constraints, and manually inspected the abstracts of the remaining 890 articles to identify the 55 articles included in this review. Additional articles in the references are also included in this review. We discuss each method and compare its strengths and weaknesses. Finally, we review other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness. Results: The bias of AIs in biomedicine can originate from multiple sources such as insufficient data, sampling bias and the use of health-irrelevant features or race-adjusted algorithms. Existing debiasing methods that focus on algorithms can be categorized into distributional or algorithmic. Distributional methods include data augmentation, data perturbation, data reweighting methods, and federated learning. Algorithmic approaches include unsupervised representation learning, adversarial learning, disentangled representation learning, loss-based methods and causality-based methods.

14.

Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine.

Jin, Qiao; Chen, Fangyuan; Zhou, Yiliang; Xu, Ziyang; Cheung, Justin M; Chen, Robert; Summers, Ronald M; Rousseau, Justin F; Ni, Peiyun; Landsman, Marc J; Baxter, Sally L; Al'Aref, Subhi J; Li, Yijia; Chen, Alex; Brejt, Josef A; Chiang, Michael F; Peng, Yifan; Lu, Zhiyong.

ArXiv ; 2024 Apr 22.

Article in English | MEDLINE | ID: mdl-38410646

ABSTRACT

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

15.

Pan-cancer transcriptomic data of ABI1 transcript variants and molecular constitutive elements identifies novel cancer metastatic and prognostic biomarkers.

Lin, Tingru; Guo, Jingzhu; Peng, Yifan; Li, Mei; Liu, Yulan; Yu, Xin; Wu, Na; Yu, Weidong.

Cancer Biomark ; 39(1): 49-62, 2024.

Article in English | MEDLINE | ID: mdl-37545215

ABSTRACT

BACKGROUND: Abelson interactor 1 (ABI1) is associated with the metastasis and prognosis of many malignancies. The association between ABI1 transcript spliced variants, their molecular constitutive exons and exon-exon junctions (EEJs) in 14 cancer types and clinical outcomes remains unsolved. OBJECTIVE: To identify novel cancer metastatic and prognostic biomarkers from ABI1 total mRNA, TSVs, and molecular constitutive elements. METHODS: Using data from TCGA and TSVdb database, the standard median of ABI1 total mRNA, TSV, exon, and EEJ expression was used as a cut-off value. Kaplan-Meier analysis, Chi-squared test (X2) and Kendall's tau statistic were used to identify novel metastatic and prognostic biomarkers, and Cox regression analysis was performed to screen and identify independent prognostic factors. RESULTS: A total of 35 ABI1-related factors were found to be closely related to the prognosis of eight candidate cancer types. A total of 14 ABI1 TSVs and molecular constitutive elements were identified as novel metastatic and prognostic biomarkers in four cancer types. A total of 13 ABI1 molecular constitutive elements were identified as independent prognostic biomarkers in six cancer types. CONCLUSIONS: In this study, we identified 14 ABI1-related novel metastatic and prognostic markers and 21 independent prognostic factors in total 8 candidate cancer types.

Subject(s)

Neoplasms , Humans , Prognosis , Neoplasms/genetics , Gene Expression Profiling , Biomarkers , RNA, Messenger/genetics , Biomarkers, Tumor/genetics , Cytoskeletal Proteins/genetics , Adaptor Proteins, Signal Transducing/genetics

16.

Confidence score: a data-driven measure for inclusive systematic reviews considering unpublished preprints.

Tong, Jiayi; Luo, Chongliang; Sun, Yifei; Duan, Rui; Saine, M Elle; Lin, Lifeng; Peng, Yifan; Lu, Yiwen; Batra, Anchita; Pan, Anni; Wang, Olivia; Li, Ruowang; Marks-Anglin, Arielle; Yang, Yuchen; Zuo, Xu; Liu, Yulun; Bian, Jiang; Kimmel, Stephen E; Hamilton, Keith; Cuker, Adam; Hubbard, Rebecca A; Xu, Hua; Chen, Yong.

J Am Med Inform Assoc ; 31(4): 809-819, 2024 Apr 03.

Article in English | MEDLINE | ID: mdl-38065694

ABSTRACT

OBJECTIVES: COVID-19, since its emergence in December 2019, has globally impacted research. Over 360 000 COVID-19-related manuscripts have been published on PubMed and preprint servers like medRxiv and bioRxiv, with preprints comprising about 15% of all manuscripts. Yet, the role and impact of preprints on COVID-19 research and evidence synthesis remain uncertain. MATERIALS AND METHODS: We propose a novel data-driven method for assigning weights to individual preprints in systematic reviews and meta-analyses. This weight termed the "confidence score" is obtained using the survival cure model, also known as the survival mixture model, which takes into account the time elapsed between posting and publication of a preprint, as well as metadata such as the number of first 2-week citations, sample size, and study type. RESULTS: Using 146 preprints on COVID-19 therapeutics posted from the beginning of the pandemic through April 30, 2021, we validated the confidence scores, showing an area under the curve of 0.95 (95% CI, 0.92-0.98). Through a use case on the effectiveness of hydroxychloroquine, we demonstrated how these scores can be incorporated practically into meta-analyses to properly weigh preprints. DISCUSSION: It is important to note that our method does not aim to replace existing measures of study quality but rather serves as a supplementary measure that overcomes some limitations of current approaches. CONCLUSION: Our proposed confidence score has the potential to improve systematic reviews of evidence related to COVID-19 and other clinical conditions by providing a data-driven approach to including unpublished manuscripts.

Subject(s)

COVID-19 , Humans , Systematic Reviews as Topic , Research Design , PubMed , Pandemics

17.

Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge.

Holste, Gregory; Zhou, Yiliang; Wang, Song; Jaiswal, Ajay; Lin, Mingquan; Zhuge, Sherry; Yang, Yuzhe; Kim, Dongkyun; Nguyen-Mau, Trong-Hieu; Tran, Minh-Triet; Jeong, Jaehyup; Park, Wongi; Ryu, Jongbin; Hong, Feng; Verma, Arsh; Yamagishi, Yosuke; Kim, Changhyun; Seo, Hyeryeong; Kang, Myungjoo; Celi, Leo Anthony; Lu, Zhiyong; Summers, Ronald M; Shih, George; Wang, Zhangyang; Peng, Yifan.

ArXiv ; 2024 Apr 01.

Article in English | MEDLINE | ID: mdl-37986726

ABSTRACT

Many real-world image recognition problems, such as diagnostic medical imaging exams, are "long-tailed" - there are a few common findings followed by many more relatively rare conditions. In chest radiography, diagnosis is both a long-tailed and multi-label problem, as patients often present with multiple findings simultaneously. While researchers have begun to study the problem of long-tailed learning in medical image recognition, few have studied the interaction of label imbalance and label co-occurrence posed by long-tailed, multi-label disease classification. To engage with the research community on this emerging topic, we conducted an open challenge, CXR-LT, on long-tailed, multi-label thorax disease classification from chest X-rays (CXRs). We publicly release a large-scale benchmark dataset of over 350,000 CXRs, each labeled with at least one of 26 clinical findings following a long-tailed distribution. We synthesize common themes of top-performing solutions, providing practical recommendations for long-tailed, multi-label medical image classification. Finally, we use these insights to propose a path forward involving vision-language foundation models for few- and zero-shot disease classification.

18.

Adopting and expanding ethical principles for generative artificial intelligence from military to healthcare.

Oniani, David; Hilsman, Jordan; Peng, Yifan; Poropatich, Ronald K; Pamplin, Jeremy C; Legault, Gary L; Wang, Yanshan.

NPJ Digit Med ; 6(1): 225, 2023 Dec 02.

Article in English | MEDLINE | ID: mdl-38042910

ABSTRACT

In 2020, the U.S. Department of Defense officially disclosed a set of ethical principles to guide the use of Artificial Intelligence (AI) technologies on future battlefields. Despite stark differences, there are core similarities between the military and medical service. Warriors on battlefields often face life-altering circumstances that require quick decision-making. Medical providers experience similar challenges in a rapidly changing healthcare environment, such as in the emergency department or during surgery treating a life-threatening condition. Generative AI, an emerging technology designed to efficiently generate valuable information, holds great promise. As computing power becomes more accessible and the abundance of health data, such as electronic health records, electrocardiograms, and medical images, increases, it is inevitable that healthcare will be revolutionized by this technology. Recently, generative AI has garnered a lot of attention in the medical research community, leading to debates about its application in the healthcare sector, mainly due to concerns about transparency and related issues. Meanwhile, questions around the potential exacerbation of health disparities due to modeling biases have raised notable ethical concerns regarding the use of this technology in healthcare. However, the ethical principles for generative AI in healthcare have been understudied. As a result, there are no clear solutions to address ethical concerns, and decision-makers often neglect to consider the significance of ethical principles before implementing generative AI in clinical practice. In an attempt to address these issues, we explore ethical principles from the military perspective and propose the "GREAT PLEA" ethical principles, namely Governability, Reliability, Equity, Accountability, Traceability, Privacy, Lawfulness, Empathy, and Eutonomy, for generative AI in healthcare. Furthermore, we introduce a framework for adopting and expanding these ethical principles in a practical way that has been useful in the military and can be applied to healthcare for generative AI, based on contrasting their ethical concerns and risks. Ultimately, we aim to proactively address the ethical dilemmas and challenges posed by the integration of generative AI into healthcare practice.

19.

Automated classification of lay health articles using natural language processing: a case study on pregnancy health and postpartum depression.

Patra, Braja Gopal; Sun, Zhaoyi; Cheng, Zilin; Kumar, Praneet Kasi Reddy Jagadeesh; Altammami, Abdullah; Liu, Yiyang; Joly, Rochelle; Jedlicka, Caroline; Delgado, Diana; Pathak, Jyotishman; Peng, Yifan; Zhang, Yiye.

Front Psychiatry ; 14: 1258887, 2023.

Article in English | MEDLINE | ID: mdl-38053538

ABSTRACT

Objective: Evidence suggests that high-quality health education and effective communication within the framework of social support hold significant potential in preventing postpartum depression. Yet, developing trustworthy and engaging health education and communication materials requires extensive expertise and substantial resources. In light of this, we propose an innovative approach that involves leveraging natural language processing (NLP) to classify publicly accessible lay articles based on their relevance and subject matter to pregnancy and mental health. Materials and methods: We manually reviewed online lay articles from credible and medically validated sources to create a gold standard corpus. This manual review process categorized the articles based on their pertinence to pregnancy and related subtopics. To streamline and expand the classification procedure for relevance and topics, we employed advanced NLP models such as Random Forest, Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-trained Transformer model (gpt-3.5-turbo). Results: The gold standard corpus included 392 pregnancy-related articles. Our manual review process categorized the reading materials according to lifestyle factors associated with postpartum depression: diet, exercise, mental health, and health literacy. A BERT-based model performed best (F1 = 0.974) in an end-to-end classification of relevance and topics. In a two-step approach, given articles already classified as pregnancy-related, gpt-3.5-turbo performed best (F1 = 0.972) in classifying the above topics. Discussion: Utilizing NLP, we can guide patients to high-quality lay reading materials as cost-effective, readily available health education and communication sources. This approach allows us to scale the information delivery specifically to individuals, enhancing the relevance and impact of the materials provided.

20.

Faithful AI in Medicine: A Systematic Review with Large Language Models and Beyond.

Xie, Qianqian; Schenck, Edward J; Yang, He S; Chen, Yong; Peng, Yifan; Wang, Fei.

Res Sq ; 2023 Dec 04.

Article in English | MEDLINE | ID: mdl-38106170

ABSTRACT

Objective: While artificial intelligence (AI), particularly large language models (LLMs), offers significant potential for medicine, it raises critical concerns due to the possibility of generating factually incorrect information, leading to potential long-term risks and ethical issues. This review aims to provide a comprehensive overview of the faithfulness problem in existing research on AI in healthcare and medicine, with a focus on the analysis of the causes of unfaithful results, evaluation metrics, and mitigation methods. Materials and Methods: Using PRISMA methodology, we sourced 5,061 records from five databases (PubMed, Scopus, IEEE Xplore, ACM Digital Library, Google Scholar) published between January 2018 to March 2023. We removed duplicates and screened records based on exclusion criteria. Results: With 40 leaving articles, we conducted a systematic review of recent developments aimed at optimizing and evaluating factuality across a variety of generative medical AI approaches. These include knowledge-grounded LLMs, text-to-text generation, multimodality-to-text generation, and automatic medical fact-checking tasks. Discussion: Current research investigating the factuality problem in medical AI is in its early stages. There are significant challenges related to data resources, backbone models, mitigation methods, and evaluation metrics. Promising opportunities exist for novel faithful medical AI research involving the adaptation of LLMs and prompt engineering. Conclusion: This comprehensive review highlights the need for further research to address the issues of reliability and factuality in medical AI, serving as both a reference and inspiration for future research into the safe, ethical use of AI in medicine and healthcare.

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL