Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 67
Filtrar
1.
Genomics Inform ; 22(1): 20, 2024 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-39482758

RESUMO

The extraction of biological regulation events has been a key focus in the field of biomedical nature language processing (BioNLP). However, existing methods often encounter challenges such as cascading errors in text mining pipelines and limitations in topic coverage from the selected corpus. Fortunately, the emergence of large language models (LLMs) presents a potential solution due to their robust semantic understanding and extensive knowledge base. To explore this potential, our project at the Biomedical Linked Annotation Hackathon 8 (BLAH 8) investigates the feasibility of using LLMs to extract biological regulation events. Our findings, based on the analysis of rice literature, demonstrate the promising performance of LLMs in this task, while also highlighting several concerns that must be addressed in future LLM-based application in low-resource topic.

2.
Artigo em Inglês | MEDLINE | ID: mdl-39483325

RESUMO

Information processing and retrieval in literature are critical for advancing scientific research and knowledge discovery. The inherent multimodality and diverse literature formats, including text, tables, and figures, present significant challenges in literature information retrieval. This paper introduces LitAI, a novel approach that employs readily available generative AI tools to enhance multimodal information retrieval from literature documents. By integrating tools such as optical character recognition (OCR) with generative AI services, LitAI facilitates the retrieval of text, tables, and figures from PDF documents. We have developed specific prompts that leverage in-context learning and prompt engineering within Generative AI to achieve precise information extraction. Our empirical evaluations, conducted on datasets from the ecological and biological sciences, demonstrate the superiority of our approach over several established baselines including Tesseract-OCR and GPT-4. The implementation of LitAI is accessible at https://github.com/ResponsibleAILab/LitAI.

3.
JMIR Form Res ; 8: e51383, 2024 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-39353189

RESUMO

BACKGROUND: Generative artificial intelligence (AI) and large language models, such as OpenAI's ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied. OBJECTIVE: This exploratory study aims to evaluate and optimize ChatGPT's capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models' interpretation and reporting accuracy through iterative prompt optimization. METHODS: We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI's processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool's criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards. RESULTS: Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models' capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive "Yes" or "No" responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire. CONCLUSIONS: Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research.


Assuntos
Delírio , Humanos , Delírio/diagnóstico , Inquéritos e Questionários , Inteligência Artificial
4.
JMIR Med Educ ; 10: e56128, 2024 Oct 08.
Artigo em Inglês | MEDLINE | ID: mdl-39378442

RESUMO

Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, "AI Family Medicine Board Exam Taker," designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI's ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4's performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4's capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI.


Assuntos
Inteligência Artificial , Certificação , Avaliação Educacional , Medicina de Família e Comunidade , Conselhos de Especialidade Profissional , Medicina de Família e Comunidade/educação , Humanos , Avaliação Educacional/métodos , Estados Unidos , Competência Clínica/normas
5.
Sci Rep ; 14(1): 24202, 2024 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-39406794

RESUMO

In the domain of natural language processing, the rise of Large Language Models and Generative AI represents a noteworthy transition, enabling machines to understand and generate text resembling that produced by humans. This research conducts a thorough examination of this transformative technology, with a focus on its influence on machine translation. The study explores the translation landscape between English and Indic languages, which include Hindi, Kannada, Malayalam, Tamil, and Telugu. To address this, the Large Language Model, BLOOMZ-3b, is utilized, which has been primarily developed for a text generation task. Multiple prompting engineering techniques for machine translation are prominently explored. The study further traverse fine-tuning the BLOOMZ-3b model using a Parameter Efficient Fine-Tuning technique called Low Rank Adaptation, aiming to reduce computational complexity. Hence, by combining innovative prompting approaches using BLOOMZ-3b model and fine-tuning the model, it contributes to continuous development of machine translation technologies beyond traditional borders of what can be done with respect to language processing. In this regard, not only does this research shed light on the intricacy of translation problems but it also sets a precedence for optimizing or adapting big language models to various languages which end up advancing Artificial Intelligence and Natural Language Processing at large.

6.
Front Med (Lausanne) ; 11: 1460553, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39478827

RESUMO

Background: The large-scale language model, GPT-4-1106-preview, supports text of up to 128 k characters, which has enhanced the capability of processing vast quantities of text. This model can perform efficient and accurate text data mining without the need for retraining, aided by prompt engineering. Method: The research approach includes prompt engineering and text vectorization processing. In this study, prompt engineering is applied to assist ChatGPT in text mining. Subsequently, the mined results are vectorized and incorporated into a local knowledge base. After cleansing 306 medical papers, data extraction was performed using ChatGPT. Following a validation and filtering process, 241 medical case data entries were obtained, leading to the construction of a local medical knowledge base. Additionally, drawing upon the Langchain framework and utilizing the local knowledge base in conjunction with ChatGPT, we successfully developed a fast and reliable chatbot. This chatbot is capable of providing recommended diagnostic and treatment information for various diseases. Results: The performance of the designed ChatGPT model, which was enhanced by data from the local knowledge base, exceeded that of the original model by 7.90% on a set of medical questions. Conclusion: ChatGPT, assisted by prompt engineering, demonstrates effective data mining capabilities for large-scale medical texts. In the future, we plan to incorporate a richer array of medical case data, expand the scale of the knowledge base, and enhance ChatGPT's performance in the medical field.

7.
JMIR AI ; 3: e52974, 2024 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-39405108

RESUMO

BACKGROUND: Brief message interventions have demonstrated immense promise in health care, yet the development of these messages has suffered from a dearth of transparency and a scarcity of publicly accessible data sets. Moreover, the researcher-driven content creation process has raised resource allocation issues, necessitating a more efficient and transparent approach to content development. OBJECTIVE: This research sets out to address the challenges of content development for SMS interventions by showcasing the use of generative artificial intelligence (AI) as a tool for content creation, transparently explaining the prompt design and content generation process, and providing the largest publicly available data set of brief messages and source code for future replication of our process. METHODS: Leveraging the pretrained large language model GPT-3.5 (OpenAI), we generate a collection of messages in the context of medication adherence for individuals with type 2 diabetes using evidence-derived behavior change techniques identified in a prior systematic review. We create an attributed prompt designed to adhere to content (readability and tone) and SMS (character count and encoder type) standards while encouraging message variability to reflect differences in behavior change techniques. RESULTS: We deliver the most extensive repository of brief messages for a singular health care intervention and the first library of messages crafted with generative AI. In total, our method yields a data set comprising 1150 messages, with 89.91% (n=1034) meeting character length requirements and 80.7% (n=928) meeting readability requirements. Furthermore, our analysis reveals that all messages exhibit diversity comparable to an existing publicly available data set created under the same theoretical framework for a similar setting. CONCLUSIONS: This research provides a novel approach to content creation for health care interventions using state-of-the-art generative AI tools. Future research is needed to assess the generated content for ethical, safety, and research standards, as well as to determine whether the intervention is successful in improving the target behaviors.

8.
Artigo em Inglês | MEDLINE | ID: mdl-39287713

RESUMO

PURPOSE: In order to produce a surgical gesture recognition system that can support a wide variety of procedures, either a very large annotated dataset must be acquired, or fitted models must generalize to new labels (so-called zero-shot capability). In this paper we investigate the feasibility of latter option. METHODS: Leveraging the bridge-prompt framework, we prompt-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. RESULTS: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema. CONCLUSION: Bridge-prompt and similar pre-trained + prompt-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.

9.
JMIR Infodemiology ; 4: e60678, 2024 Sep 26.
Artigo em Inglês | MEDLINE | ID: mdl-39326035

RESUMO

BACKGROUND: During the COVID-19 pandemic, the rapid spread of misinformation on social media created significant public health challenges. Large language models (LLMs), pretrained on extensive textual data, have shown potential in detecting misinformation, but their performance can be influenced by factors such as prompt engineering (ie, modifying LLM requests to assess changes in output). One form of prompt engineering is role-playing, where, upon request, OpenAI's ChatGPT imitates specific social roles or identities. This research examines how ChatGPT's accuracy in detecting COVID-19-related misinformation is affected when it is assigned social identities in the request prompt. Understanding how LLMs respond to different identity cues can inform messaging campaigns, ensuring effective use in public health communications. OBJECTIVE: This study investigates the impact of role-playing prompts on ChatGPT's accuracy in detecting misinformation. This study also assesses differences in performance when misinformation is explicitly stated versus implied, based on contextual knowledge, and examines the reasoning given by ChatGPT for classification decisions. METHODS: Overall, 36 real-world tweets about COVID-19 collected in September 2021 were categorized into misinformation, sentiment (opinions aligned vs unaligned with public health guidelines), corrections, and neutral reporting. ChatGPT was tested with prompts incorporating different combinations of multiple social identities (ie, political beliefs, education levels, locality, religiosity, and personality traits), resulting in 51,840 runs. Two control conditions were used to compare results: prompts with no identities and those including only political identity. RESULTS: The findings reveal that including social identities in prompts reduces average detection accuracy, with a notable drop from 68.1% (SD 41.2%; no identities) to 29.3% (SD 31.6%; all identities included). Prompts with only political identity resulted in the lowest accuracy (19.2%, SD 29.2%). ChatGPT was also able to distinguish between sentiments expressing opinions not aligned with public health guidelines from misinformation making declarative statements. There were no consistent differences in performance between explicit and implicit misinformation requiring contextual knowledge. While the findings show that the inclusion of identities decreased detection accuracy, it remains uncertain whether ChatGPT adopts views aligned with social identities: when assigned a conservative identity, ChatGPT identified misinformation with nearly the same accuracy as it did when assigned a liberal identity. While political identity was mentioned most frequently in ChatGPT's explanations for its classification decisions, the rationales for classifications were inconsistent across study conditions, and contradictory explanations were provided in some instances. CONCLUSIONS: These results indicate that ChatGPT's ability to classify misinformation is negatively impacted when role-playing social identities, highlighting the complexity of integrating human biases and perspectives in LLMs. This points to the need for human oversight in the use of LLMs for misinformation detection. Further research is needed to understand how LLMs weigh social identities in prompt-based tasks and explore their application in different cultural contexts.


Assuntos
COVID-19 , Comunicação , Desempenho de Papéis , Mídias Sociais , Humanos , Pandemias , SARS-CoV-2 , Saúde Pública
11.
Eur J Obstet Gynecol Reprod Biol ; 302: 238-241, 2024 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-39326228

RESUMO

In line with the digital transformation trend in medical training, students may resort to artificial intelligence (AI) for learning. This study assessed the interaction between obstetrics residents and ChatGPT during clinically oriented summative evaluations related to acute hepatic steatosis of pregnancy, and their self-reported competencies in information technology (IT) and AI. The participants in this semi-qualitative observational study were 14 obstetrics residents from two university hospitals. Students' queries were categorized into three distinct types: third-party enquiries; search-engine-style queries; and GPT-centric prompts. Responses were compared against a standardized answer produced by ChatGPT with a Delphi-developed expert prompt. Data analysis employed descriptive statistics and correlation analysis to explore the relationship between AI/IT skills and response accuracy. The study participants showed moderate IT proficiency but low AI proficiency. Interaction with ChatGPT regarding clinical signs of acute hepatic steatosis gravidarum revealed a preference for third-party questioning, resulting in only 21% accurate responses due to misinterpretation of medical acronyms. No correlation was found between AI response accuracy and the residents' self-assessed IT or AI skills, with most expressing dissatisfaction with their AI training. This study underlines the discrepancy between perceived and actual AI proficiency, highlighted by clinically inaccurate yet plausible AI responses - a manifestation of the 'stochastic parrot' phenomenon. These findings advocate for the inclusion of structured AI literacy programmes in medical education, focusing on prompt engineering. These academic skills are essential to exploit AI's potential in obstetrics and gynaecology. The ultimate aim is to optimize patient care in AI-augmented health care, and prevent misleading and unsafe knowledge acquisition.


Assuntos
Inteligência Artificial , Internato e Residência , Obstetrícia , Humanos , Obstetrícia/educação , Feminino , Gravidez , Competência Clínica , Adulto
12.
Clin Imaging ; 115: 110276, 2024 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-39288636

RESUMO

Large Language Models (LLM) like ChatGPT-4 hold significant promise in medical application, especially in the field of radiology. While previous studies have shown the promise of ChatGTP-4 in textual-based scenarios, its performance on image-based response remains suboptimal. This study investigates the impact of prompt engineering on ChatGPT-4's accuracy on the 2022 American College of Radiology In Training Test Questions for Diagnostic Radiology Residents that include textual and visual-based questions. Four personas were created, each with unique prompts, and evaluated using ChatGPT-4. Results indicate that encouraging prompts and those disclaiming responsibility led to higher overall accuracy (number of questions answered correctly) compared to other personas. Personas that threaten the LLM with legal action or mounting clinical responsibility were not only found to score less, but also refrain of answering questions at a higher rate. These findings highlight the importance of prompt context in optimizing LLM responses and the need for further research to integrate AI responsibly into medical practice.


Assuntos
Internato e Residência , Radiologia , Radiologia/educação , Humanos , Avaliação Educacional/métodos , Competência Clínica , Estados Unidos , Educação de Pós-Graduação em Medicina
13.
J Med Internet Res ; 26: e60501, 2024 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-39255030

RESUMO

BACKGROUND: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. OBJECTIVE: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. METHODS: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). RESULTS: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. CONCLUSIONS: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.


Assuntos
Processamento de Linguagem Natural , Humanos , Informática Médica/métodos
14.
J Plast Reconstr Aesthet Surg ; 98: 158-160, 2024 Sep 05.
Artigo em Inglês | MEDLINE | ID: mdl-39255523

RESUMO

This study assesses ChatGPT's (GPT-3.5) performance on the 2021 ASPS Plastic Surgery In-Service Examination using prompt modifications and Retrieval Augmented Generation (RAG). ChatGPT was instructed to act as a "resident," "attending," or "medical student," and RAG utilized a curated vector database for context. Results showed no significant improvement, with the "resident" prompt yielding the highest accuracy at 54%, and RAG failing to enhance performance, with accuracy remaining at 54.3%. Despite appropriate reasoning when correct, ChatGPT's overall performance fell in the 10th percentile, indicating the need for fine-tuning and more sophisticated approaches to improve AI's utility in complex medical tasks.

15.
J Clin Med ; 13(17)2024 Aug 28.
Artigo em Inglês | MEDLINE | ID: mdl-39274316

RESUMO

Large Language Models (LLMs have the potential to revolutionize clinical medicine by enhancing healthcare access, diagnosis, surgical planning, and education. However, their utilization requires careful, prompt engineering to mitigate challenges like hallucinations and biases. Proper utilization of LLMs involves understanding foundational concepts such as tokenization, embeddings, and attention mechanisms, alongside strategic prompting techniques to ensure accurate outputs. For innovative healthcare solutions, it is essential to maintain ongoing collaboration between AI technology and medical professionals. Ethical considerations, including data security and bias mitigation, are critical to their application. By leveraging LLMs as supplementary resources in research and education, we can enhance learning and support knowledge-based inquiries, ultimately advancing the quality and accessibility of medical care. Continued research and development are necessary to fully realize the potential of LLMs in transforming healthcare.

16.
Res Sq ; 2024 Aug 29.
Artigo em Inglês | MEDLINE | ID: mdl-39257988

RESUMO

Background: The growing demand for genomic testing and limited access to experts necessitate innovative service models. While chatbots have shown promise in supporting genomic services like pre-test counseling, their use in returning positive genetic results, especially using the more recent large language models (LLMs) remains unexplored. Objective: This study reports the prompt engineering process and intrinsic evaluation of the LLM component of a chatbot designed to support returning positive population-wide genomic screening results. Methods: We used a three-step prompt engineering process, including Retrieval-Augmented Generation (RAG) and few-shot techniques to develop an open-response chatbot. This was then evaluated using two hypothetical scenarios, with experts rating its performance using a 5-point Likert scale across eight criteria: tone, clarity, program accuracy, domain accuracy, robustness, efficiency, boundaries, and usability. Results: The chatbot achieved an overall score of 3.88 out of 5 across all criteria and scenarios. The highest ratings were in Tone (4.25), Usability (4.25), and Boundary management (4.0), followed by Efficiency (3.88), Clarity and Robustness (3.81), and Domain Accuracy (3.63). The lowest-rated criterion was Program Accuracy, which scored 3.25. Discussion: The LLM handled open-ended queries and maintained boundaries, while the lower Program Accuracy rating indicates areas for improvement. Future work will focus on refining prompts, expanding evaluations, and exploring optimal hybrid chatbot designs that integrate LLM components with rule-based chatbot components to enhance genomic service delivery.

17.
Nucl Med Mol Imaging ; 58(6): 323-331, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39308492

RESUMO

The rapid advancements in natural language processing, particularly with the development of Generative Pre-trained Transformer (GPT) models, have opened up new avenues for researchers across various domains. This review article explores the potential of GPT as a research tool, focusing on the core functionalities, key features, and real-world applications of the GPT-4 model. We delve into the concept of prompt engineering, a crucial technique for effectively utilizing GPT, and provide guidelines for designing optimal prompts. Through case studies, we demonstrate how GPT can be applied at various stages of the research process, including literature review, data analysis, and manuscript preparation. The utilization of GPT is expected to enhance research efficiency, stimulate creative thinking, facilitate interdisciplinary collaboration, and increase the impact of research findings. However, it is essential to view GPT as a complementary tool rather than a substitute for human expertise, keeping in mind its limitations and ethical considerations. As GPT continues to evolve, researchers must develop a deep understanding of this technology and leverage its potential to advance their research endeavors while being mindful of its implications.

18.
J Am Med Inform Assoc ; 31(11): 2660-2667, 2024 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-39178375

RESUMO

OBJECTIVES: Patients are increasingly being given direct access to their medical records. However, radiology reports are written for clinicians and typically contain medical jargon, which can be confusing. One solution is for radiologists to provide a "colloquial" version that is accessible to the layperson. Because manually generating these colloquial translations would represent a significant burden for radiologists, a way to automatically produce accurate, accessible patient-facing reports is desired. We propose a novel method to produce colloquial translations of radiology reports by providing specialized prompts to a large language model (LLM). MATERIALS AND METHODS: Our method automatically extracts and defines medical terms and includes their definitions in the LLM prompt. Using our method and a naive strategy, translations were generated at 4 different reading levels for 100 de-identified neuroradiology reports from an academic medical center. Translations were evaluated by a panel of radiologists for accuracy, likability, harm potential, and readability. RESULTS: Our approach translated the Findings and Impression sections at the 8th-grade level with accuracies of 88% and 93%, respectively. Across all grade levels, our approach was 20% more accurate than the baseline method. Overall, translations were more readable than the original reports, as evaluated using standard readability indices. CONCLUSION: We find that our translations at the eighth-grade level strike an optimal balance between accuracy and readability. Notably, this corresponds to nationally recognized recommendations for patient-facing health communication. We believe that using this approach to draft patient-accessible reports will benefit patients without significantly increasing the burden on radiologists.


Assuntos
Processamento de Linguagem Natural , Humanos , Registros Eletrônicos de Saúde , Sistemas de Informação em Radiologia , Radiologia , Compreensão , Terminologia como Assunto
19.
Am J Pharm Educ ; 88(10): 101266, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39153573

RESUMO

OBJECTIVE: This study aimed to develop a prompt engineering procedure for test question mapping and then determine the effectiveness of test question mapping using Chat Generative Pre-Trained Transformer (ChatGPT) compared to human faculty mapping. METHODS: We conducted a cross-sectional study to compare ChatGPT and human mapping using a sample of 139 test questions from modules within the Integrated Pharmacotherapeutics course series. The test questions were mapped by 3 faculty members to both module objectives and the Accreditation Council for Pharmacy Education Standards 2016 (Standards 2016) to create the "correct answer". Prompt engineering procedures were created to facilitate mapping with ChatGPT, and ChatGPT mapping results were compared with human mapping. RESULTS: ChatGPT mapped test questions directly to the "correct answer" based on human consensus in 68.0% of cases, and the program matched with at least one individual human response in another 20.1% of cases for a total of 88.1% agreement with human mappers. When humans fully agreed with the mapping decision, ChatGPT was more likely to map correctly. CONCLUSION: This study presents a practical use case with prompt engineering tailored for college assessment or curriculum committees to facilitate efficient test questions and educational outcomes mapping.


Assuntos
Currículo , Educação em Farmácia , Avaliação Educacional , Estudos Transversais , Humanos , Educação em Farmácia/métodos , Avaliação Educacional/métodos , Avaliação Educacional/normas , Estudantes de Farmácia , Acreditação/normas
20.
J Med Internet Res ; 26: e52758, 2024 Aug 16.
Artigo em Inglês | MEDLINE | ID: mdl-39151163

RESUMO

BACKGROUND: The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers. OBJECTIVE: We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title and abstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity for identifying relevant records. METHODS: We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with 1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls. The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction according to each study's inclusion criteria and optimization for screening were carried out using a GPT-4-based flow without manual adjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judged as included. RESULTS: On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time required for screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5 and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with those of human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4 judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Further investigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the cases incorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria. CONCLUSIONS: Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supports its practical application in systematic review screenings. Future research should aim to generalize this approach and explore its effectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility.


Assuntos
Inteligência Artificial , Revisões Sistemáticas como Assunto , Ciência da Informação , Idioma , Sensibilidade e Especificidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA