Pesquisa | BVS IEC

1.

AI can help people feel heard, but an AI label diminishes this impact.

Yin, Yidan; Jia, Nan; Wakslak, Cheryl J.

Proc Natl Acad Sci U S A ; 121(14): e2319112121, 2024 Apr 02.

Artigo em Inglês | MEDLINE | ID: mdl-38551835

RESUMO

People want to "feel heard" to perceive that they are understood, validated, and valued. Can AI serve the deeply human function of making others feel heard? Our research addresses two fundamental issues: Can AI generate responses that make human recipients feel heard, and how do human recipients react when they believe the response comes from AI? We conducted an experiment and a follow-up study to disentangle the effects of actual source of a message and the presumed source. We found that AI-generated messages made recipients feel more heard than human-generated messages and that AI was better at detecting emotions. However, recipients felt less heard when they realized that a message came from AI (vs. human). Finally, in a follow-up study where the responses were rated by third-party raters, we found that compared with humans, AI demonstrated superior discipline in offering emotional support, a crucial element in making individuals feel heard, while avoiding excessive practical suggestions, which may be less effective in achieving this goal. Our research underscores the potential and limitations of AI in meeting human psychological needs. These findings suggest that while AI demonstrates enhanced capabilities to provide emotional support, the devaluation of AI responses poses a key challenge for effectively leveraging AI's capabilities.

Assuntos

Emoções , Motivação , Humanos , Seguimentos , Emoções/fisiologia

2.

ifDEEPre: large protein language-based deep learning enables interpretable and fast predictions of enzyme commission numbers.

Tan, Qingxiong; Xiao, Jin; Chen, Jiayang; Wang, Yixuan; Zhang, Zeliang; Zhao, Tiancheng; Li, Yu.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-38942594

RESUMO

Accurate understanding of the biological functions of enzymes is vital for various tasks in both pathologies and industrial biotechnology. However, the existing methods are usually not fast enough and lack explanations on the prediction results, which severely limits their real-world applications. Following our previous work, DEEPre, we propose a new interpretable and fast version (ifDEEPre) by designing novel self-guided attention and incorporating biological knowledge learned via large protein language models to accurately predict the commission numbers of enzymes and confirm their functions. Novel self-guided attention is designed to optimize the unique contributions of representations, automatically detecting key protein motifs to provide meaningful interpretations. Representations learned from raw protein sequences are strictly screened to improve the running speed of the framework, 50 times faster than DEEPre while requiring 12.89 times smaller storage space. Large language modules are incorporated to learn physical properties from hundreds of millions of proteins, extending biological knowledge of the whole network. Extensive experiments indicate that ifDEEPre outperforms all the current methods, achieving more than 14.22% larger F1-score on the NEW dataset. Furthermore, the trained ifDEEPre models accurately capture multi-level protein biological patterns and infer evolutionary trends of enzymes by taking only raw sequences without label information. Meanwhile, ifDEEPre predicts the evolutionary relationships between different yeast sub-species, which are highly consistent with the ground truth. Case studies indicate that ifDEEPre can detect key amino acid motifs, which have important implications for designing novel enzymes. A web server running ifDEEPre is available at https://proj.cse.cuhk.edu.hk/aihlab/ifdeepre/ to provide convenient services to the public. Meanwhile, ifDEEPre is freely available on GitHub at https://github.com/ml4bio/ifDEEPre/.

Assuntos

Aprendizado Profundo , Enzimas , Enzimas/química , Enzimas/metabolismo , Biologia Computacional/métodos , Software , Proteínas/química , Proteínas/metabolismo , Bases de Dados de Proteínas , Algoritmos

3.

ChIP-GPT: a managed large language model for robust data extraction from biomedical database records.

Cinquin, Olivier.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38314912

RESUMO

Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.

Assuntos

Medicina , Humanos , Linhagem Celular , Imunoprecipitação da Cromatina , Bases de Dados Factuais , Idioma

4.

'Bingo'-a large language model- and graph neural network-based workflow for the prediction of essential genes from protein data.

Ma, Jiani; Song, Jiangning; Young, Neil D; Chang, Bill C H; Korhonen, Pasi K; Campos, Tulio L; Liu, Hui; Gasser, Robin B.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38152979

RESUMO

The identification and characterization of essential genes are central to our understanding of the core biological functions in eukaryotic organisms, and has important implications for the treatment of diseases caused by, for example, cancers and pathogens. Given the major constraints in testing the functions of genes of many organisms in the laboratory, due to the absence of in vitro cultures and/or gene perturbation assays for most metazoan species, there has been a need to develop in silico tools for the accurate prediction or inference of essential genes to underpin systems biological investigations. Major advances in machine learning approaches provide unprecedented opportunities to overcome these limitations and accelerate the discovery of essential genes on a genome-wide scale. Here, we developed and evaluated a large language model- and graph neural network (LLM-GNN)-based approach, called 'Bingo', to predict essential protein-coding genes in the metazoan model organisms Caenorhabditis elegans and Drosophila melanogaster as well as in Mus musculus and Homo sapiens (a HepG2 cell line) by integrating LLM and GNNs with adversarial training. Bingo predicts essential genes under two 'zero-shot' scenarios with transfer learning, showing promise to compensate for a lack of high-quality genomic and proteomic data for non-model organisms. In addition, the attention mechanisms and GNNExplainer were employed to manifest the functional sites and structural domain with most contribution to essentiality. In conclusion, Bingo provides the prospect of being able to accurately infer the essential genes of little- or under-studied organisms of interest, and provides a biological explanation for gene essentiality.

Assuntos

Proteínas de Drosophila , Genes Essenciais , Camundongos , Animais , Proteômica , Drosophila melanogaster/genética , Fluxo de Trabalho , Redes Neurais de Computação , Proteínas/genética , Proteínas dos Microfilamentos/genética , Proteínas de Drosophila/genética

5.

Opportunities and challenges for ChatGPT and large language models in biomedicine and health.

Tian, Shubo; Jin, Qiao; Yeganova, Lana; Lai, Po-Ting; Zhu, Qingqing; Chen, Xiuying; Yang, Yifan; Chen, Qingyu; Kim, Won; Comeau, Donald C; Islamaj, Rezarta; Kapoor, Aadit; Gao, Xin; Lu, Zhiyong.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38168838

RESUMO

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.

Assuntos

Armazenamento e Recuperação da Informação , Idioma , Humanos , Privacidade , Pesquisadores

6.

RDscan: Extracting RNA-disease relationship from the literature based on pre-training model.

Zhang, Yang; Yang, Yu; Ren, Liping; Ning, Lin; Zou, Quan; Luo, Nanchao; Zhang, Yinghui; Liu, Ruijun.

Methods ; 228: 48-54, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-38789016

RESUMO

With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https://cellknowledge.com.cn/RDscan/.

Assuntos

Mineração de Dados , RNA , Mineração de Dados/métodos , RNA/genética , Humanos , Aprendizado de Máquina , Doença/genética , Máquina de Vetores de Suporte , Software

7.

Performance of large language models on benign prostatic hyperplasia frequently asked questions.

Zhang, YuNing; Dong, Yijie; Mei, Zihan; Hou, Yiqing; Wei, Minyan; Yeung, Yat Hin; Xu, Jiale; Hua, Qing; Lai, LiMei; Li, Ning; Xia, ShuJun; Zhou, Chun; Zhou, JianQiao.

Prostate ; 84(9): 807-813, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38558009

RESUMO

BACKGROUND: Benign prostatic hyperplasia (BPH) is a common condition, yet it is challenging for the average BPH patient to find credible and accurate information about BPH. Our goal is to evaluate and compare the accuracy and reproducibility of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and the New Bing Chat in responding to a BPH frequently asked questions (FAQs) questionnaire. METHODS: A total of 45 questions related to BPH were categorized into basic and professional knowledge. Three LLM-ChatGPT-3.5, ChatGPT-4, and New Bing Chat-were utilized to generate responses to these questions. Responses were graded as comprehensive, correct but inadequate, mixed with incorrect/outdated data, or completely incorrect. Reproducibility was assessed by generating two responses for each question. All responses were reviewed and judged by experienced urologists. RESULTS: All three LLMs exhibited high accuracy in generating responses to questions, with accuracy rates ranging from 86.7% to 100%. However, there was no statistically significant difference in response accuracy among the three (p > 0.017 for all comparisons). Additionally, the accuracy of the LLMs' responses to the basic knowledge questions was roughly equivalent to that of the specialized knowledge questions, showing a difference of less than 3.5% (GPT-3.5: 90% vs. 86.7%; GPT-4: 96.7% vs. 95.6%; New Bing: 96.7% vs. 93.3%). Furthermore, all three LLMs demonstrated high reproducibility, with rates ranging from 93.3% to 97.8%. CONCLUSIONS: ChatGPT-3.5, ChatGPT-4, and New Bing Chat offer accurate and reproducible responses to BPH-related questions, establishing them as valuable resources for enhancing health literacy and supporting BPH patients in conjunction with healthcare professionals.

Assuntos

Hiperplasia Prostática , Humanos , Hiperplasia Prostática/diagnóstico , Masculino , Reprodutibilidade dos Testes , Inquéritos e Questionários , Idioma , Educação de Pacientes como Assunto/métodos

8.

ChatGPT4 Outperforms Endoscopists for Determination of Postcolonoscopy Rescreening and Surveillance Recommendations.

Chang, Patrick W; Amini, Maziar M; Davis, Rio O; Nguyen, Denis D; Dodge, Jennifer L; Lee, Helen; Sheibani, Sarah; Phan, Jennifer; Buxbaum, James L; Sahakian, Ara B.

Clin Gastroenterol Hepatol ; 2024 May 09.

Artigo em Inglês | MEDLINE | ID: mdl-38729387

RESUMO

BACKGROUND & AIMS: Large language models including Chat Generative Pretrained Transformers version 4 (ChatGPT4) improve access to artificial intelligence, but their impact on the clinical practice of gastroenterology is undefined. This study compared the accuracy, concordance, and reliability of ChatGPT4 colonoscopy recommendations for colorectal cancer rescreening and surveillance with contemporary guidelines and real-world gastroenterology practice. METHODS: History of present illness, colonoscopy data, and pathology reports from patients undergoing procedures at 2 large academic centers were entered into ChatGPT4 and it was queried for the next recommended colonoscopy follow-up interval. Using the McNemar test and inter-rater reliability, we compared the recommendations made by ChatGPT4 with the actual surveillance interval provided in the endoscopist's procedure report (gastroenterology practice) and the appropriate US Multisociety Task Force (USMSTF) guidance. The latter was generated for each case by an expert panel using the clinical information and guideline documents as reference. RESULTS: Text input of de-identified data into ChatGPT4 from 505 consecutive patients undergoing colonoscopy between January 1 and April 30, 2023, elicited a successful follow-up recommendation in 99.2% of the queries. ChatGPT4 recommendations were in closer agreement with the USMSTF Panel (85.7%) than gastroenterology practice recommendations with the USMSTF Panel (75.4%) (P < .001). Of the 14.3% discordant recommendations between ChatGPT4 and the USMSTF Panel, recommendations were for later screening in 26 (5.1%) and for earlier screening in 44 (8.7%) cases. The inter-rater reliability was good for ChatGPT4 vs USMSTF Panel (Fleiss κ, 0.786; 95% CI, 0.734-0.838; P < .001). CONCLUSIONS: Initial real-world results suggest that ChatGPT4 can define routine colonoscopy screening intervals accurately based on verbatim input of clinical data. Large language models have potential for clinical applications, but further training is needed for broad use.

9.

Evaluating the efficacy of few-shot learning for GPT-4Vision in neurodegenerative disease histopathology: A comparative analysis with convolutional neural network model.

Ono, Daisuke; Dickson, Dennis W; Koga, Shunsuke.

Neuropathol Appl Neurobiol ; 50(4): e12997, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-39010256

RESUMO

AIMS: Recent advances in artificial intelligence, particularly with large language models like GPT-4Vision (GPT-4V)-a derivative feature of ChatGPT-have expanded the potential for medical image interpretation. This study evaluates the accuracy of GPT-4V in image classification tasks of histopathological images and compares its performance with a traditional convolutional neural network (CNN). METHODS: We utilised 1520 images, including haematoxylin and eosin staining and tau immunohistochemistry, from patients with various neurodegenerative diseases, such as Alzheimer's disease (AD), progressive supranuclear palsy (PSP) and corticobasal degeneration (CBD). We assessed GPT-4V's performance using multi-step prompts to determine how textual context influences image interpretation. We also employed few-shot learning to enhance improvements in GPT-4V's diagnostic performance in classifying three specific tau lesions-astrocytic plaques, neuritic plaques and tufted astrocytes-and compared the outcomes with the CNN model YOLOv8. RESULTS: GPT-4V accurately recognised staining techniques and tissue origin but struggled with specific lesion identification. The interpretation of images was notably influenced by the provided textual context, which sometimes led to diagnostic inaccuracies. For instance, when presented with images of the motor cortex, the diagnosis shifted inappropriately from AD to CBD or PSP. However, few-shot learning markedly improved GPT-4V's diagnostic capabilities, enhancing accuracy from 40% in zero-shot learning to 90% with 20-shot learning, matching the performance of YOLOv8, which required 100-shot learning to achieve the same accuracy. CONCLUSIONS: Although GPT-4V faces challenges in independently interpreting histopathological images, few-shot learning significantly improves its performance. This approach is especially promising for neuropathology, where acquiring extensive labelled datasets is often challenging.

Assuntos

Redes Neurais de Computação , Doenças Neurodegenerativas , Humanos , Doenças Neurodegenerativas/patologia , Interpretação de Imagem Assistida por Computador/métodos , Doença de Alzheimer/patologia

10.

Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports.

Gu, Kyowon; Lee, Jeong Hyun; Shin, Jaeseung; Hwang, Jeong Ah; Min, Ji Hye; Jeong, Woo Kyoung; Lee, Min Woo; Song, Kyoung Doo; Bae, Sung Hwan.

Liver Int ; 44(7): 1578-1587, 2024 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-38651924

RESUMO

BACKGROUND AND AIMS: The Liver Imaging Reporting and Data System (LI-RADS) offers a standardized approach for imaging hepatocellular carcinoma. However, the diverse styles and structures of radiology reports complicate automatic data extraction. Large language models hold the potential for structured data extraction from free-text reports. Our objective was to evaluate the performance of Generative Pre-trained Transformer (GPT)-4 in extracting LI-RADS features and categories from free-text liver magnetic resonance imaging (MRI) reports. METHODS: Three radiologists generated 160 fictitious free-text liver MRI reports written in Korean and English, simulating real-world practice. Of these, 20 were used for prompt engineering, and 140 formed the internal test cohort. Seventy-two genuine reports, authored by 17 radiologists were collected and de-identified for the external test cohort. LI-RADS features were extracted using GPT-4, with a Python script calculating categories. Accuracies in each test cohort were compared. RESULTS: On the external test, the accuracy for the extraction of major LI-RADS features, which encompass size, nonrim arterial phase hyperenhancement, nonperipheral 'washout', enhancing 'capsule' and threshold growth, ranged from .92 to .99. For the rest of the LI-RADS features, the accuracy ranged from .86 to .97. For the LI-RADS category, the model showed an accuracy of .85 (95% CI: .76, .93). CONCLUSIONS: GPT-4 shows promise in extracting LI-RADS features, yet further refinement of its prompting strategy and advancements in its neural network architecture are crucial for reliable use in processing complex real-world MRI reports.

Assuntos

Neoplasias Hepáticas , Imageamento por Ressonância Magnética , Humanos , Neoplasias Hepáticas/diagnóstico por imagem , Carcinoma Hepatocelular/diagnóstico por imagem , Processamento de Linguagem Natural , Sistemas de Informação em Radiologia , República da Coreia , Mineração de Dados , Fígado/diagnóstico por imagem

11.

Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings.

Tsai, Chung-You; Hsieh, Shang-Ju; Huang, Hung-Hsiang; Deng, Juinn-Horng; Huang, Yi-You; Cheng, Pai-Yu.

World J Urol ; 42(1): 250, 2024 Apr 23.

Artigo em Inglês | MEDLINE | ID: mdl-38652322

RESUMO

PURPOSE: To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan urology board examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across 12 urology domains. METHODS: 450 multiple-choice questions from TUBE(2020-2022) were presented to two models. Three urologists assessed correctness and consistency of each response. Accuracy quantifies correct answers; consistency assesses logic and coherence in explanations out of total responses, alongside a penalty reduction experiment with prompt variations. Univariate logistic regression was applied for subgroup comparison. RESULTS: ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7% (2020), 58.0% (2021), and 50.7% (2022), significantly surpassing ChatGPT-3.5 (33.8%, OR = 2.68, 95% CI [2.05-3.52]). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR = 0.69, p = 0.05). A high consistency rate of 91.6% in explanations across all domains indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p = 0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making. CONCLUSIONS: ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements of urology-specific AI tools.

Assuntos

Avaliação Educacional , Urologia , Taiwan , Avaliação Educacional/métodos , Competência Clínica , Humanos , Conselhos de Especialidade Profissional

12.

GPT-4 performance on querying scientific publications: reproducibility, accuracy, and impact of an instruction sheet.

Tao, Kaiming; Osman, Zachary A; Tzou, Philip L; Rhee, Soo-Yon; Ahluwalia, Vineet; Shafer, Robert W.

BMC Med Res Methodol ; 24(1): 139, 2024 Jun 25.

Artigo em Inglês | MEDLINE | ID: mdl-38918736

RESUMO

BACKGROUND: Large language models (LLMs) that can efficiently screen and identify studies meeting specific criteria would streamline literature reviews. Additionally, those capable of extracting data from publications would enhance knowledge discovery by reducing the burden on human reviewers. METHODS: We created an automated pipeline utilizing OpenAI GPT-4 32 K API version "2023-05-15" to evaluate the accuracy of the LLM GPT-4 responses to queries about published papers on HIV drug resistance (HIVDR) with and without an instruction sheet. The instruction sheet contained specialized knowledge designed to assist a person trying to answer questions about an HIVDR paper. We designed 60 questions pertaining to HIVDR and created markdown versions of 60 published HIVDR papers in PubMed. We presented the 60 papers to GPT-4 in four configurations: (1) all 60 questions simultaneously; (2) all 60 questions simultaneously with the instruction sheet; (3) each of the 60 questions individually; and (4) each of the 60 questions individually with the instruction sheet. RESULTS: GPT-4 achieved a mean accuracy of 86.9% - 24.0% higher than when the answers to papers were permuted. The overall recall and precision were 72.5% and 87.4%, respectively. The standard deviation of three replicates for the 60 questions ranged from 0 to 5.3% with a median of 1.2%. The instruction sheet did not significantly increase GPT-4's accuracy, recall, or precision. GPT-4 was more likely to provide false positive answers when the 60 questions were submitted individually compared to when they were submitted together. CONCLUSIONS: GPT-4 reproducibly answered 3600 questions about 60 papers on HIVDR with moderately high accuracy, recall, and precision. The instruction sheet's failure to improve these metrics suggests that more sophisticated approaches are necessary. Either enhanced prompt engineering or finetuning an open-source model could further improve an LLM's ability to answer questions about highly specialized HIVDR papers.

Assuntos

Infecções por HIV , Humanos , Reprodutibilidade dos Testes , Infecções por HIV/tratamento farmacológico , PubMed , Publicações/estatística & dados numéricos , Publicações/normas , Armazenamento e Recuperação da Informação/métodos , Armazenamento e Recuperação da Informação/normas , Software

13.

Methodological insights into ChatGPT's screening performance in systematic reviews.

Issaiy, Mahbod; Ghanaati, Hossein; Kolahi, Shahriar; Shakiba, Madjid; Jalali, Amir Hossein; Zarei, Diana; Kazemian, Sina; Avanaki, Mahsa Alborzi; Firouznia, Kavous.

BMC Med Res Methodol ; 24(1): 78, 2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38539117

RESUMO

BACKGROUND: The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data. METHODS: A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT's performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals. RESULTS: ChatGPT completed the screening process within an hour, while GPs took an average of 7-10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs' sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27. CONCLUSIONS: ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.

Assuntos

Benchmarking , Pesquisa Biomédica , Humanos , Revisões Sistemáticas como Assunto , Simulação por Computador , Consenso

14.

De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning.

Ye, Gavin.

J Comput Aided Mol Des ; 38(1): 20, 2024 Apr 22.

Artigo em Inglês | MEDLINE | ID: mdl-38647700

RESUMO

In recent years, generative machine learning algorithms have been successful in designing innovative drug-like molecules. SMILES is a sequence-like language used in most effective drug design models. Due to data's sequential structure, models such as recurrent neural networks and transformers can design pharmacological compounds with optimized efficacy. Large language models have advanced recently, but their implications on drug design have not yet been explored. Although one study successfully pre-trained a large chemistry model (LCM), its application to specific tasks in drug discovery is unknown. In this study, the drug design task is modeled as a causal language modeling problem. Thus, the procedure of reward modeling, supervised fine-tuning, and proximal policy optimization was used to transfer the LCM to drug design, similar to Open AI's ChatGPT and InstructGPT procedures. By combining the SMILES sequence with chemical descriptors, the novel efficacy evaluation model exceeded its performance compared to previous studies. After proximal policy optimization, the drug design model generated molecules with 99.2% having efficacy pIC50 > 7 towards the amyloid precursor protein, with 100% of the generated molecules being valid and novel. This demonstrated the applicability of LCMs in drug discovery, with benefits including less data consumption while fine-tuning. The applicability of LCMs to drug discovery opens the door for larger studies involving reinforcement-learning with human feedback, where chemists provide feedback to LCMs and generate higher-quality molecules. LCMs' ability to design similar molecules from datasets paves the way for more accessible, non-patented alternatives to drug molecules.

Assuntos

Desenho de Fármacos , Humanos , Aprendizado de Máquina , Descoberta de Drogas/métodos , Algoritmos , Redes Neurais de Computação , Modelos Químicos , Aprendizado de Máquina Supervisionado

15.

A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records.

Chiang, Chia-Chun; Luo, Man; Dumkrieger, Gina; Trivedi, Shubham; Chen, Yi-Chieh; Chao, Chieh-Ju; Schwedt, Todd J; Sarker, Abeed; Banerjee, Imon.

Headache ; 64(4): 400-409, 2024 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-38525734

RESUMO

OBJECTIVE: To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free-text clinical notes. BACKGROUND: Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. METHODS: This was a retrospective cross-sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) model zero-shot, (3) GPT-2 QA model few-shot training fine-tuned on clinical notes, and (4) GPT-2 generative model few-shot training fine-tuned on clinical notes to generate the answer by considering the context of included text. RESULTS: The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT-2 generative model was the best-performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT-2-based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R2 score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. CONCLUSION: We developed a robust information extraction model based on a state-of-the-art large language model, a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT-2-based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub. Additional fine-tuning of the algorithm might be required when applied to different health-care systems for various clinical use cases.

Assuntos

Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Humanos , Estudos Retrospectivos , Estudos Transversais , Masculino , Feminino , Cefaleia , Adulto , Pessoa de Meia-Idade , Algoritmos

16.

Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis.

Wei, Qiuhong; Yao, Zhengxiong; Cui, Ying; Wei, Bo; Jin, Zhezhen; Xu, Ximing.

J Biomed Inform ; 151: 104620, 2024 03.

Artigo em Inglês | MEDLINE | ID: mdl-38462064

RESUMO

OBJECTIVE: Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research. METHODS: An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was "ChatGPT," without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327. RESULTS: A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %-60 %, I2 = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency. CONCLUSION: This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results' reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.

Assuntos

Inteligência Artificial , Comunicação , Bases de Dados Factuais , Reprodutibilidade dos Testes

17.

Chat-ePRO: Development and pilot study of an electronic patient-reported outcomes system based on ChatGPT.

Chen, Zikang; Wang, Qinchuan; Sun, Yaoqian; Cai, Hailing; Lu, Xudong.

J Biomed Inform ; 154: 104651, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38703936

RESUMO

OBJECTIVE: Chatbots have the potential to improve user compliance in electronic Patient-Reported Outcome (ePRO) system. Compared to rule-based chatbots, Large Language Model (LLM) offers advantages such as simplifying the development process and increasing conversational flexibility. However, there is currently a lack of practical applications of LLMs in ePRO systems. Therefore, this study utilized ChatGPT to develop the Chat-ePRO system and designed a pilot study to explore the feasibility of building an ePRO system based on LLM. MATERIALS AND METHODS: This study employed prompt engineering and offline knowledge distillation to design a dialogue algorithm and built the Chat-ePRO system on the WeChat Mini Program platform. In order to compare Chat-ePRO with the form-based ePRO and rule-based chatbot ePRO used in previous studies, we conducted a pilot study applying the three ePRO systems sequentially at the Sir Run Run Shaw Hospital to collect patients' PRO data. RESULT: Chat-ePRO is capable of correctly generating conversation based on PRO forms (success rate: 95.7 %) and accurately extracting the PRO data instantaneously from conversation (Macro-F1: 0.95). The majority of subjective evaluations from doctors (>70 %) suggest that Chat-ePRO is able to comprehend questions and consistently generate responses. Pilot study shows that Chat-ePRO demonstrates higher response rate (9/10, 90 %) and longer interaction time (10.86 s/turn) compared to the other two methods. CONCLUSION: Our study demonstrated the feasibility of utilizing algorithms such as prompt engineering to drive LLM in completing ePRO data collection tasks, and validated that the Chat-ePRO system can effectively enhance patient compliance.

Assuntos

Algoritmos , Medidas de Resultados Relatados pelo Paciente , Projetos Piloto , Humanos , Masculino , Feminino , Registros Eletrônicos de Saúde , Pessoa de Meia-Idade , Adulto

18.

Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction.

Peng, Cheng; Yang, Xi; Smith, Kaleb E; Yu, Zehao; Chen, Aokun; Bian, Jiang; Wu, Yonghui.

J Biomed Inform ; 153: 104630, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38548007

RESUMO

OBJECTIVE: To develop soft prompt-based learning architecture for large language models (LLMs), examine prompt-tuning using frozen/unfrozen LLMs, and assess their abilities in transfer learning and few-shot learning. METHODS: We developed a soft prompt-based learning architecture and compared 4 strategies including (1) fine-tuning without prompts; (2) hard-prompting with unfrozen LLMs; (3) soft-prompting with unfrozen LLMs; and (4) soft-prompting with frozen LLMs. We evaluated GatorTron, a clinical LLM with up to 8.9 billion parameters, and compared GatorTron with 4 existing transformer models for clinical concept and relation extraction on 2 benchmark datasets for adverse drug events and social determinants of health (SDoH). We evaluated the few-shot learning ability and generalizability for cross-institution applications. RESULTS AND CONCLUSION: When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6 â¼ 3.1 % and 1.2 â¼ 2.9 %, respectively; GatorTron-345 M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming other two models by 0.2 â¼ 2 % and 0.6 â¼ 11.7 %, respectively. When LLMs are frozen, small LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen models. Soft prompting with a frozen GatorTron-8.9B model achieved the best performance for cross-institution evaluation. We demonstrate that (1) machines can learn soft prompts better than hard prompts composed by human, (2) frozen LLMs have good few-shot learning ability and generalizability for cross-institution applications, (3) frozen LLMs reduce computing cost to 2.5 â¼ 6 % of previous methods using unfrozen LLMs, and (4) frozen LLMs require large models (e.g., over several billions of parameters) for good performance.

Assuntos

Processamento de Linguagem Natural , Humanos , Aprendizado de Máquina , Mineração de Dados/métodos , Algoritmos , Determinantes Sociais da Saúde , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos

19.

Identifying social determinants of health from clinical narratives: A study of performance, documentation ratio, and potential bias.

Yu, Zehao; Peng, Cheng; Yang, Xi; Dang, Chong; Adekkanattu, Prakash; Gopal Patra, Braja; Peng, Yifan; Pathak, Jyotishman; Wilson, Debbie L; Chang, Ching-Yuan; Lo-Ciganic, Wei-Hsuan; George, Thomas J; Hogan, William R; Guo, Yi; Bian, Jiang; Wu, Yonghui.

J Biomed Inform ; 153: 104642, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38621641

RESUMO

OBJECTIVE: To develop a natural language processing (NLP) package to extract social determinants of health (SDoH) from clinical narratives, examine the bias among race and gender groups, test the generalizability of extracting SDoH for different disease groups, and examine population-level extraction ratio. METHODS: We developed SDoH corpora using clinical notes identified at the University of Florida (UF) Health. We systematically compared 7 transformer-based large language models (LLMs) and developed an open-source package - SODA (i.e., SOcial DeterminAnts) to facilitate SDoH extraction from clinical narratives. We examined the performance and potential bias of SODA for different race and gender groups, tested the generalizability of SODA using two disease domains including cancer and opioid use, and explored strategies for improvement. We applied SODA to extract 19 categories of SDoH from the breast (n = 7,971), lung (n = 11,804), and colorectal cancer (n = 6,240) cohorts to assess patient-level extraction ratio and examine the differences among race and gender groups. RESULTS: We developed an SDoH corpus using 629 clinical notes of cancer patients with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH, and another cross-disease validation corpus using 200 notes from opioid use patients with 4,342 SDoH concepts/attributes. We compared 7 transformer models and the GatorTron model achieved the best mean average strict/lenient F1 scores of 0.9122 and 0.9367 for SDoH concept extraction and 0.9584 and 0.9593 for linking attributes to SDoH concepts. There is a small performance gap (â¼4%) between Males and Females, but a large performance gap (>16 %) among race groups. The performance dropped when we applied the cancer SDoH model to the opioid cohort; fine-tuning using a smaller opioid SDoH corpus improved the performance. The extraction ratio varied in the three cancer cohorts, in which 10 SDoH could be extracted from over 70 % of cancer patients, but 9 SDoH could be extracted from less than 70 % of cancer patients. Individuals from the White and Black groups have a higher extraction ratio than other minority race groups. CONCLUSIONS: Our SODA package achieved good performance in extracting 19 categories of SDoH from clinical narratives. The SODA package with pre-trained transformer models is available at https://github.com/uf-hobi-informatics-lab/SODA_Docker.

Assuntos

Narração , Processamento de Linguagem Natural , Determinantes Sociais da Saúde , Humanos , Feminino , Masculino , Viés , Registros Eletrônicos de Saúde , Documentação/métodos , Mineração de Dados/métodos

20.

Automated classification of brain MRI reports using fine-tuned large language models.

Kanzawa, Jun; Yasaka, Koichiro; Fujita, Nana; Fujiwara, Shin; Abe, Osamu.

Neuroradiology ; 2024 Jul 12.

Artigo em Inglês | MEDLINE | ID: mdl-38995393

RESUMO

PURPOSE: This study aimed to investigate the efficacy of fine-tuned large language models (LLM) in classifying brain MRI reports into pretreatment, posttreatment, and nontumor cases. METHODS: This retrospective study included 759, 284, and 164 brain MRI reports for training, validation, and test dataset. Radiologists stratified the reports into three groups: nontumor (group 1), posttreatment tumor (group 2), and pretreatment tumor (group 3) cases. A pretrained Bidirectional Encoder Representations from Transformers Japanese model was fine-tuned using the training dataset and evaluated on the validation dataset. The model which demonstrated the highest accuracy on the validation dataset was selected as the final model. Two additional radiologists were involved in classifying reports in the test datasets for the three groups. The model's performance on test dataset was compared to that of two radiologists. RESULTS: The fine-tuned LLM attained an overall accuracy of 0.970 (95% CI: 0.930-0.990). The model's sensitivity for group 1/2/3 was 1.000/0.864/0.978. The model's specificity for group1/2/3 was 0.991/0.993/0.958. No statistically significant differences were found in terms of accuracy, sensitivity, and specificity between the LLM and human readers (p ≥ 0.371). The LLM completed the classification task approximately 20-26-fold faster than the radiologists. The area under the receiver operating characteristic curve for discriminating groups 2 and 3 from group 1 was 0.994 (95% CI: 0.982-1.000) and for discriminating group 3 from groups 1 and 2 was 0.992 (95% CI: 0.982-1.000). CONCLUSION: Fine-tuned LLM demonstrated a comparable performance with radiologists in classifying brain MRI reports, while requiring substantially less time.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA