Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 440
Filtrar
Más filtros

Tipo del documento
Intervalo de año de publicación
1.
Cell ; 175(4): 1045-1058.e16, 2018 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-30388443

RESUMEN

Protein N-glycosylation is a widespread post-translational modification. The first committed step in this process is catalysed by dolichyl-phosphate N-acetylglucosamine-phosphotransferase DPAGT1 (GPT/E.C. 2.7.8.15). Missense DPAGT1 variants cause congenital myasthenic syndrome and disorders of glycosylation. In addition, naturally-occurring bactericidal nucleoside analogues such as tunicamycin are toxic to eukaryotes due to DPAGT1 inhibition, preventing their clinical use. Our structures of DPAGT1 with the substrate UDP-GlcNAc and tunicamycin reveal substrate binding modes, suggest a mechanism of catalysis, provide an understanding of how mutations modulate activity (thus causing disease) and allow design of non-toxic "lipid-altered" tunicamycins. The structure-tuned activity of these analogues against several bacterial targets allowed the design of potent antibiotics for Mycobacterium tuberculosis, enabling treatment in vitro, in cellulo and in vivo, providing a promising new class of antimicrobial drug.


Asunto(s)
Antibióticos Antituberculosos/farmacología , Trastornos Congénitos de Glicosilación/metabolismo , Inhibidores Enzimáticos/farmacología , N-Acetilglucosaminiltransferasas/química , Animales , Antibióticos Antituberculosos/química , Sitios de Unión , Trastornos Congénitos de Glicosilación/genética , Inhibidores Enzimáticos/química , Femenino , Células HEK293 , Células Hep G2 , Humanos , Metabolismo de los Lípidos , Ratones , Simulación del Acoplamiento Molecular , Mutación , N-Acetilglucosaminiltransferasas/antagonistas & inhibidores , N-Acetilglucosaminiltransferasas/genética , N-Acetilglucosaminiltransferasas/metabolismo , Unión Proteica , Células Sf9 , Spodoptera , Tunicamicina/química , Tunicamicina/farmacología , Uridina Difosfato Ácido Glucurónico/química , Uridina Difosfato Ácido Glucurónico/metabolismo
2.
Proc Natl Acad Sci U S A ; 121(34): e2308950121, 2024 Aug 20.
Artículo en Inglés | MEDLINE | ID: mdl-39133853

RESUMEN

The social and behavioral sciences have been increasingly using automated text analysis to measure psychological constructs in text. We explore whether GPT, the large-language model (LLM) underlying the AI chatbot ChatGPT, can be used as a tool for automated psychological text analysis in several languages. Across 15 datasets (n = 47,925 manually annotated tweets and news headlines), we tested whether different versions of GPT (3.5 Turbo, 4, and 4 Turbo) can accurately detect psychological constructs (sentiment, discrete emotions, offensiveness, and moral foundations) across 12 languages. We found that GPT (r = 0.59 to 0.77) performed much better than English-language dictionary analysis (r = 0.20 to 0.30) at detecting psychological constructs as judged by manual annotators. GPT performed nearly as well as, and sometimes better than, several top-performing fine-tuned machine learning models. Moreover, GPT's performance improved across successive versions of the model, particularly for lesser-spoken languages, and became less expensive. Overall, GPT may be superior to many existing methods of automated text analysis, since it achieves relatively high accuracy across many languages, requires no training data, and is easy to use with simple prompts (e.g., "is this text negative?") and little coding experience. We provide sample code and a video tutorial for analyzing text with the GPT application programming interface. We argue that GPT and other LLMs help democratize automated text analysis by making advanced natural language processing capabilities more accessible, and may help facilitate more cross-linguistic research with understudied languages.


Asunto(s)
Multilingüismo , Humanos , Lenguaje , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Emociones , Medios de Comunicación Sociales
3.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38314912

RESUMEN

Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.


Asunto(s)
Medicina , Humanos , Línea Celular , Inmunoprecipitación de Cromatina , Bases de Datos Factuales , Lenguaje
4.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37204192

RESUMEN

Accurately predicting the antigen-binding specificity of adaptive immune receptors (AIRs), such as T-cell receptors (TCRs) and B-cell receptors (BCRs), is essential for discovering new immune therapies. However, the diversity of AIR chain sequences limits the accuracy of current prediction methods. This study introduces SC-AIR-BERT, a pre-trained model that learns comprehensive sequence representations of paired AIR chains to improve binding specificity prediction. SC-AIR-BERT first learns the 'language' of AIR sequences through self-supervised pre-training on a large cohort of paired AIR chains from multiple single-cell resources. The model is then fine-tuned with a multilayer perceptron head for binding specificity prediction, employing the K-mer strategy to enhance sequence representation learning. Extensive experiments demonstrate the superior AUC performance of SC-AIR-BERT compared with current methods for TCR- and BCR-binding specificity prediction.


Asunto(s)
Receptores de Antígenos de Linfocitos B , Receptores de Antígenos de Linfocitos T , Humanos , Receptores de Antígenos de Linfocitos T/genética , Receptores de Antígenos de Linfocitos B/genética , Redes Neurales de la Computación , Especificidad de Anticuerpos
5.
J Infect Dis ; 2024 Aug 13.
Artículo en Inglés | MEDLINE | ID: mdl-39136574

RESUMEN

BACKGROUND: Surgical site infection (SSI) is a common and costly complication in spinal surgery. Identifying risk factors and preventive strategies is crucial for reducing SSIs. GPT-4 has evolved from a simple text-based tool to a sophisticated multimodal data expert, invaluable for clinicians. This study explored GPT-4's applications in SSI management across various clinical scenarios. METHODS: GPT-4 was employed in various clinical scenarios related to SSIs in spinal surgery. Researchers designed specific questions for GPT-4 to generate tailored responses. Six evaluators assessed these responses for logic and accuracy using a 5-point Likert scale. Inter-rater consistency was measured with Fleiss' kappa, and radar charts visualized GPT-4's performance. RESULTS: The inter-rater consistency, measured by Fleiss' kappa, ranged from 0.62 to 0.83. The overall average scores for logic and accuracy were 24.27±0.4 and 24.46±0.25 on 5-point Likert scale. Radar charts showed GPT-4's consistently high performance across various criteria. GPT-4 demonstrated high proficiency in creating personalized treatment plans tailored to diverse clinical patient records and offered interactive patient education. It significantly improved SSI management strategies, infection prediction models, and identified emerging research trends. However, it had limitations in fine-tuning antibiotic treatments and customizing patient education materials. CONCLUSIONS: GPT-4 represents a significant advancement in managing SSIs in spinal surgery, promoting patient-centered care and precision medicine. Despite some limitations in antibiotic customization and patient education, GPT-4's continuous learning, attention to data privacy and security, collaboration with healthcare professionals, and patient acceptance of AI recommendations suggest its potential to revolutionize SSI management, requiring further development and clinical integration.

6.
BMC Bioinformatics ; 25(1): 225, 2024 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-38926641

RESUMEN

PURPOSE: Large Language Models (LLMs) like Generative Pre-trained Transformer (GPT) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. METHOD: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. RESULTS: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. CONCLUSION: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT .


Asunto(s)
Quimioinformática , Quimioinformática/métodos , Interacciones Farmacológicas , Estructura Molecular
7.
Br J Haematol ; 204(4): 1523-1528, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38070128

RESUMEN

In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained Transformer 4 (GPT-4) but also conducted on other artificial intelligence models: PaLm 2 and Llama-2. Using detailed haematological histories that include both clinical, molecular and donor data, we conducted a triple-blind survey to compare LLMs to haematology residents. We found that residents significantly outperformed LLMs (p = 0.02), particularly in transplant eligibility assessment (p = 0.01). Our triple-blind methodology aimed to mitigate potential biases in evaluating LLMs and revealed both their promise and limitations in deciphering complex haematological clinical scenarios.


Asunto(s)
Inteligencia Artificial , Trasplante de Células Madre Hematopoyéticas , Humanos , Lenguaje , Donantes de Tejidos
8.
Neuropathol Appl Neurobiol ; 50(4): e12997, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39010256

RESUMEN

AIMS: Recent advances in artificial intelligence, particularly with large language models like GPT-4Vision (GPT-4V)-a derivative feature of ChatGPT-have expanded the potential for medical image interpretation. This study evaluates the accuracy of GPT-4V in image classification tasks of histopathological images and compares its performance with a traditional convolutional neural network (CNN). METHODS: We utilised 1520 images, including haematoxylin and eosin staining and tau immunohistochemistry, from patients with various neurodegenerative diseases, such as Alzheimer's disease (AD), progressive supranuclear palsy (PSP) and corticobasal degeneration (CBD). We assessed GPT-4V's performance using multi-step prompts to determine how textual context influences image interpretation. We also employed few-shot learning to enhance improvements in GPT-4V's diagnostic performance in classifying three specific tau lesions-astrocytic plaques, neuritic plaques and tufted astrocytes-and compared the outcomes with the CNN model YOLOv8. RESULTS: GPT-4V accurately recognised staining techniques and tissue origin but struggled with specific lesion identification. The interpretation of images was notably influenced by the provided textual context, which sometimes led to diagnostic inaccuracies. For instance, when presented with images of the motor cortex, the diagnosis shifted inappropriately from AD to CBD or PSP. However, few-shot learning markedly improved GPT-4V's diagnostic capabilities, enhancing accuracy from 40% in zero-shot learning to 90% with 20-shot learning, matching the performance of YOLOv8, which required 100-shot learning to achieve the same accuracy. CONCLUSIONS: Although GPT-4V faces challenges in independently interpreting histopathological images, few-shot learning significantly improves its performance. This approach is especially promising for neuropathology, where acquiring extensive labelled datasets is often challenging.


Asunto(s)
Redes Neurales de la Computación , Enfermedades Neurodegenerativas , Humanos , Enfermedades Neurodegenerativas/patología , Interpretación de Imagen Asistida por Computador/métodos , Enfermedad de Alzheimer/patología
9.
Ann Surg Oncol ; 31(6): 3887-3893, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38472675

RESUMEN

BACKGROUND: The rise of artificial intelligence (AI) in medicine has revealed the potential of ChatGPT as a pivotal tool in medical diagnosis and treatment. This study assesses the efficacy of ChatGPT versions 3.5 and 4.0 in addressing renal cell carcinoma (RCC) clinical inquiries. Notably, fine-tuning and iterative optimization of the model corrected ChatGPT's limitations in this area. METHODS: In our study, 80 RCC-related clinical questions from urology experts were posed three times to both ChatGPT 3.5 and ChatGPT 4.0, seeking binary (yes/no) responses. We then statistically analyzed the answers. Finally, we fine-tuned the GPT-3.5 Turbo model using these questions, and assessed its training outcomes. RESULTS: We found that the average accuracy rates of answers provided by ChatGPT versions 3.5 and 4.0 were 67.08% and 77.50%, respectively. ChatGPT 4.0 outperformed ChatGPT 3.5, with a higher accuracy rate in responses (p < 0.05). By counting the number of correct responses to the 80 questions, we then found that although ChatGPT 4.0 performed better (p < 0.05), both versions were subject to instability in answering. Finally, by fine-tuning the GPT-3.5 Turbo model, we found that the correct rate of responses to these questions could be stabilized at 93.75%. Iterative optimization of the model can result in 100% response accuracy. CONCLUSION: We compared ChatGPT versions 3.5 and 4.0 in addressing clinical RCC questions, identifying their limitations. By applying the GPT-3.5 Turbo fine-tuned model iterative training method, we enhanced AI strategies in renal oncology. This approach is set to enhance ChatGPT's database and clinical guidance capabilities, optimizing AI in this field.


Asunto(s)
Inteligencia Artificial , Carcinoma de Células Renales , Neoplasias Renales , Humanos , Neoplasias Renales/patología , Carcinoma de Células Renales/patología , Pronóstico
10.
Liver Int ; 44(7): 1578-1587, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38651924

RESUMEN

BACKGROUND AND AIMS: The Liver Imaging Reporting and Data System (LI-RADS) offers a standardized approach for imaging hepatocellular carcinoma. However, the diverse styles and structures of radiology reports complicate automatic data extraction. Large language models hold the potential for structured data extraction from free-text reports. Our objective was to evaluate the performance of Generative Pre-trained Transformer (GPT)-4 in extracting LI-RADS features and categories from free-text liver magnetic resonance imaging (MRI) reports. METHODS: Three radiologists generated 160 fictitious free-text liver MRI reports written in Korean and English, simulating real-world practice. Of these, 20 were used for prompt engineering, and 140 formed the internal test cohort. Seventy-two genuine reports, authored by 17 radiologists were collected and de-identified for the external test cohort. LI-RADS features were extracted using GPT-4, with a Python script calculating categories. Accuracies in each test cohort were compared. RESULTS: On the external test, the accuracy for the extraction of major LI-RADS features, which encompass size, nonrim arterial phase hyperenhancement, nonperipheral 'washout', enhancing 'capsule' and threshold growth, ranged from .92 to .99. For the rest of the LI-RADS features, the accuracy ranged from .86 to .97. For the LI-RADS category, the model showed an accuracy of .85 (95% CI: .76, .93). CONCLUSIONS: GPT-4 shows promise in extracting LI-RADS features, yet further refinement of its prompting strategy and advancements in its neural network architecture are crucial for reliable use in processing complex real-world MRI reports.


Asunto(s)
Neoplasias Hepáticas , Imagen por Resonancia Magnética , Humanos , Neoplasias Hepáticas/diagnóstico por imagen , Carcinoma Hepatocelular/diagnóstico por imagen , Procesamiento de Lenguaje Natural , Sistemas de Información Radiológica , República de Corea , Minería de Datos , Hígado/diagnóstico por imagen
11.
BMC Med Res Methodol ; 24(1): 78, 2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38539117

RESUMEN

BACKGROUND: The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data. METHODS: A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT's performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals. RESULTS: ChatGPT completed the screening process within an hour, while GPs took an average of 7-10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs' sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27. CONCLUSIONS: ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.


Asunto(s)
Benchmarking , Investigación Biomédica , Humanos , Revisiones Sistemáticas como Asunto , Simulación por Computador , Consenso
12.
BMC Med Res Methodol ; 24(1): 139, 2024 Jun 25.
Artículo en Inglés | MEDLINE | ID: mdl-38918736

RESUMEN

BACKGROUND: Large language models (LLMs) that can efficiently screen and identify studies meeting specific criteria would streamline literature reviews. Additionally, those capable of extracting data from publications would enhance knowledge discovery by reducing the burden on human reviewers. METHODS: We created an automated pipeline utilizing OpenAI GPT-4 32 K API version "2023-05-15" to evaluate the accuracy of the LLM GPT-4 responses to queries about published papers on HIV drug resistance (HIVDR) with and without an instruction sheet. The instruction sheet contained specialized knowledge designed to assist a person trying to answer questions about an HIVDR paper. We designed 60 questions pertaining to HIVDR and created markdown versions of 60 published HIVDR papers in PubMed. We presented the 60 papers to GPT-4 in four configurations: (1) all 60 questions simultaneously; (2) all 60 questions simultaneously with the instruction sheet; (3) each of the 60 questions individually; and (4) each of the 60 questions individually with the instruction sheet. RESULTS: GPT-4 achieved a mean accuracy of 86.9% - 24.0% higher than when the answers to papers were permuted. The overall recall and precision were 72.5% and 87.4%, respectively. The standard deviation of three replicates for the 60 questions ranged from 0 to 5.3% with a median of 1.2%. The instruction sheet did not significantly increase GPT-4's accuracy, recall, or precision. GPT-4 was more likely to provide false positive answers when the 60 questions were submitted individually compared to when they were submitted together. CONCLUSIONS: GPT-4 reproducibly answered 3600 questions about 60 papers on HIVDR with moderately high accuracy, recall, and precision. The instruction sheet's failure to improve these metrics suggests that more sophisticated approaches are necessary. Either enhanced prompt engineering or finetuning an open-source model could further improve an LLM's ability to answer questions about highly specialized HIVDR papers.


Asunto(s)
Infecciones por VIH , Humanos , Reproducibilidad de los Resultados , Infecciones por VIH/tratamiento farmacológico , PubMed , Publicaciones/estadística & datos numéricos , Publicaciones/normas , Almacenamiento y Recuperación de la Información/métodos , Almacenamiento y Recuperación de la Información/normas , Programas Informáticos
13.
Pediatr Blood Cancer ; : e31256, 2024 Aug 11.
Artículo en Inglés | MEDLINE | ID: mdl-39129151

RESUMEN

In the era of big data, young patients may be overwhelmed by artificial intelligence-based tools, like chatbots. Five clinical experts were asked to evaluate the performance of the most currently used chatbots in providing information on a rare cancer affecting young people, like rhabdomyosarcoma. Generally speaking, despite their high performance in giving general information about the disease, these chatbots were considered by the experts to be inadequate in providing suggestions on cancer treatments and specialized centers, and also lacking in "sensitivity." Efforts are planned by the pediatric oncology community to improve the quality of data used to train these tools.

14.
J Comput Aided Mol Des ; 38(1): 20, 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38647700

RESUMEN

In recent years, generative machine learning algorithms have been successful in designing innovative drug-like molecules. SMILES is a sequence-like language used in most effective drug design models. Due to data's sequential structure, models such as recurrent neural networks and transformers can design pharmacological compounds with optimized efficacy. Large language models have advanced recently, but their implications on drug design have not yet been explored. Although one study successfully pre-trained a large chemistry model (LCM), its application to specific tasks in drug discovery is unknown. In this study, the drug design task is modeled as a causal language modeling problem. Thus, the procedure of reward modeling, supervised fine-tuning, and proximal policy optimization was used to transfer the LCM to drug design, similar to Open AI's ChatGPT and InstructGPT procedures. By combining the SMILES sequence with chemical descriptors, the novel efficacy evaluation model exceeded its performance compared to previous studies. After proximal policy optimization, the drug design model generated molecules with 99.2% having efficacy pIC50 > 7 towards the amyloid precursor protein, with 100% of the generated molecules being valid and novel. This demonstrated the applicability of LCMs in drug discovery, with benefits including less data consumption while fine-tuning. The applicability of LCMs to drug discovery opens the door for larger studies involving reinforcement-learning with human feedback, where chemists provide feedback to LCMs and generate higher-quality molecules. LCMs' ability to design similar molecules from datasets paves the way for more accessible, non-patented alternatives to drug molecules.


Asunto(s)
Diseño de Fármacos , Humanos , Aprendizaje Automático , Descubrimiento de Drogas/métodos , Algoritmos , Redes Neurales de la Computación , Modelos Químicos , Aprendizaje Automático Supervisado
15.
J Gastroenterol Hepatol ; 39(1): 81-106, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37855067

RESUMEN

BACKGROUND AND AIM: Colonoscopy is commonly used in screening and surveillance for colorectal cancer. Multiple different guidelines provide recommendations on the interval between colonoscopies. This can be challenging for non-specialist healthcare providers to navigate. Large language models like ChatGPT are a potential tool for parsing patient histories and providing advice. However, the standard GPT model is not designed for medical use and can hallucinate. One way to overcome these challenges is to provide contextual information with medical guidelines to help the model respond accurately to queries. Our study compares the standard GPT4 against a contextualized model provided with relevant screening guidelines. We evaluated whether the models could provide correct advice for screening and surveillance intervals for colonoscopy. METHODS: Relevant guidelines pertaining to colorectal cancer screening and surveillance were formulated into a knowledge base for GPT. We tested 62 example case scenarios (three times each) on standard GPT4 and on a contextualized model with the knowledge base. RESULTS: The contextualized GPT4 model outperformed the standard GPT4 in all domains. No high-risk features were missed, and only two cases had hallucination of additional high-risk features. A correct interval to colonoscopy was provided in the majority of cases. Guidelines were appropriately cited in almost all cases. CONCLUSIONS: A contextualized GPT4 model could identify high-risk features and quote appropriate guidelines without significant hallucination. It gave a correct interval to the next colonoscopy in the majority of cases. This provides proof of concept that ChatGPT with appropriate refinement can serve as an accurate physician assistant.


Asunto(s)
Colonoscopía , Neoplasias Colorrectales , Humanos , Neoplasias Colorrectales/diagnóstico , Neoplasias Colorrectales/prevención & control , Neoplasias Colorrectales/epidemiología , Factores de Riesgo , Detección Precoz del Cáncer , Alucinaciones
16.
J Gastroenterol Hepatol ; 39(8): 1535-1543, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-38627920

RESUMEN

BACKGROUND AND AIM: Effective clinical event classification is essential for clinical research and quality improvement. The validation of artificial intelligence (AI) models like Generative Pre-trained Transformer 4 (GPT-4) for this task and comparison with conventional methods remains unexplored. METHODS: We evaluated the performance of the GPT-4 model for classifying gastrointestinal (GI) bleeding episodes from 200 medical discharge summaries and compared the results with human review and an International Classification of Diseases (ICD) code-based system. The analysis included accuracy, sensitivity, and specificity evaluation, using ground truth determined by physician reviewers. RESULTS: GPT-4 exhibited an accuracy of 94.4% in identifying GI bleeding occurrences, outperforming ICD codes (accuracy 63.5%, P < 0.001). GPT-4's accuracy was either slightly lower or statistically similar to individual human reviewers (Reviewer 1: 98.5%, P < 0.001; Reviewer 2: 90.8%, P = 0.170). For location classification, GPT-4 achieved accuracies of 81.7% and 83.5% for confirmed and probable GI bleeding locations, respectively, with figures that were either slightly lower or comparable with those of human reviewers. GPT-4 was highly efficient, analyzing the dataset in 12.7 min at a cost of 21.2 USD, whereas human reviewers required 8-9 h each. CONCLUSION: Our study indicates GPT-4 offers a reliable, cost-efficient, and faster alternative to current clinical event classification methods, outperforming the conventional ICD coding system and performing comparably to individual expert human reviewers. Its implementation could facilitate more accurate and granular clinical research and quality audits. Future research should explore scalability, prompt and model tuning, and ethical implications of high-performance AI models in clinical data processing.


Asunto(s)
Inteligencia Artificial , Hemorragia Gastrointestinal , Clasificación Internacional de Enfermedades , Humanos , Hemorragia Gastrointestinal/clasificación , Hemorragia Gastrointestinal/etiología , Sensibilidad y Especificidad
17.
Philos Trans A Math Phys Eng Sci ; 382(2270): 20230254, 2024 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-38403056

RESUMEN

In this paper, we experimentally evaluate the zero-shot performance of GPT-4 against prior generations of GPT on the entire uniform bar examination (UBE), including not only the multiple-choice multistate bar examination (MBE), but also the open-ended multistate essay exam (MEE) and multistate performance test (MPT) components. On the MBE, GPT-4 significantly outperforms both human test-takers and prior models, demonstrating a 26% increase over ChatGPT and beating humans in five of seven subject areas. On the MEE and MPT, which have not previously been evaluated by scholars, GPT-4 scores an average of 4.2/6.0 when compared with much lower scores for ChatGPT. Graded across the UBE components, in the manner in which a human test-taker would be, GPT-4 scores approximately 297 points, significantly in excess of the passing threshold for all UBE jurisdictions. These findings document not just the rapid and remarkable advance of large language model performance generally, but also the potential for such models to support the delivery of legal services in society. This article is part of the theme issue 'A complexity science approach to law and governance'.

18.
J Biomed Inform ; 157: 104706, 2024 Aug 08.
Artículo en Inglés | MEDLINE | ID: mdl-39121932

RESUMEN

OBJECTIVE: To develop an Artificial Intelligence (AI)-based anomaly detection model as a complement of an "astute physician" in detecting novel disease cases in a hospital and preventing emerging outbreaks. METHODS: Data included hospitalized patients (n = 120,714) at a safety-net hospital in Massachusetts. A novel Generative Pre-trained Transformer (GPT)-based clinical anomaly detection system was designed and further trained using Empirical Risk Minimization (ERM), which can model a hospitalized patient's Electronic Health Records (EHR) and detect atypical patients. Methods and performance metrics, similar to the ones behind the recent Large Language Models (LLMs), were leveraged to capture the dynamic evolution of the patient's clinical variables and compute an Out-Of-Distribution (OOD) anomaly score. RESULTS: In a completely unsupervised setting, hospitalizations for Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection could have been predicted by our GPT model at the beginning of the COVID-19 pandemic, with an Area Under the Receiver Operating Characteristic Curve (AUC) of 92.2 %, using 31 extracted clinical variables and a 3-day detection window. Our GPT achieves individual patient-level anomaly detection and mortality prediction AUC of 78.3 % and 94.7 %, outperforming traditional linear models by 6.6 % and 9 %, respectively. Different types of clinical trajectories of a SARS-CoV-2 infection are captured by our model to make interpretable detections, while a trend of over-pessimistic outcome prediction yields a more effective detection pathway. Furthermore, our comprehensive GPT model can potentially assist clinicians with forecasting patient clinical variables and developing personalized treatment plans. CONCLUSION: This study demonstrates that an emerging outbreak can be accurately detected within a hospital, by using a GPT to model patient EHR time sequences and labeling them as anomalous when actual outcomes are not supported by the model. Such a GPT is also a comprehensive model with the functionality of generating future patient clinical variables, which can potentially assist clinicians in developing personalized treatment plans.

19.
Neuroradiology ; 66(8): 1245-1250, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-38705899

RESUMEN

We compared different LLMs, notably chatGPT, GPT4, and Google Bard and we tested whether their performance differs in subspeciality domains, in executing examinations from four different courses of the European Society of Neuroradiology (ESNR) notably anatomy/embryology, neuro-oncology, head and neck and pediatrics. Written exams of ESNR were used as input data, related to anatomy/embryology (30 questions), neuro-oncology (50 questions), head and neck (50 questions), and pediatrics (50 questions). All exams together, and each exam separately were introduced to the three LLMs: chatGPT 3.5, GPT4, and Google Bard. Statistical analyses included a group-wise Friedman test followed by a pair-wise Wilcoxon test with multiple comparison corrections. Overall, there was a significant difference between the 3 LLMs (p < 0.0001), with GPT4 having the highest accuracy (70%), followed by chatGPT 3.5 (54%) and Google Bard (36%). The pair-wise comparison showed significant differences between chatGPT vs GPT 4 (p < 0.0001), chatGPT vs Bard (p < 0. 0023), and GPT4 vs Bard (p < 0.0001). Analyses per subspecialty showed the highest difference between the best LLM (GPT4, 70%) versus the worst LLM (Google Bard, 24%) in the head and neck exam, while the difference was least pronounced in neuro-oncology (GPT4, 62% vs Google Bard, 48%). We observed significant differences in the performance of the three different LLMs in the running of official exams organized by ESNR. Overall GPT 4 performed best, and Google Bard performed worst. This difference varied depending on subspeciality and was most pronounced in head and neck subspeciality.


Asunto(s)
Sociedades Médicas , Humanos , Europa (Continente) , Evaluación Educacional , Radiología/educación , Neurorradiografía
20.
Neuroradiology ; 66(1): 73-79, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37994939

RESUMEN

PURPOSE: The noteworthy performance of Chat Generative Pre-trained Transformer (ChatGPT), an artificial intelligence text generation model based on the GPT-4 architecture, has been demonstrated in various fields; however, its potential applications in neuroradiology remain unexplored. This study aimed to evaluate the diagnostic performance of GPT-4 based ChatGPT in neuroradiology. METHODS: We collected 100 consecutive "Case of the Week" cases from the American Journal of Neuroradiology between October 2021 and September 2023. ChatGPT generated a diagnosis from patient's medical history and imaging findings for each case. Then the diagnostic accuracy rate was determined using the published ground truth. Each case was categorized by anatomical location (brain, spine, and head & neck), and brain cases were further divided into central nervous system (CNS) tumor and non-CNS tumor groups. Fisher's exact test was conducted to compare the accuracy rates among the three anatomical locations, as well as between the CNS tumor and non-CNS tumor groups. RESULTS: ChatGPT achieved a diagnostic accuracy rate of 50% (50/100 cases). There were no significant differences between the accuracy rates of the three anatomical locations (p = 0.89). The accuracy rate was significantly lower for the CNS tumor group compared to the non-CNS tumor group in the brain cases (16% [3/19] vs. 62% [36/58], p < 0.001). CONCLUSION: This study demonstrated the diagnostic performance of ChatGPT in neuroradiology. ChatGPT's diagnostic accuracy varied depending on disease etiologies, and its diagnostic accuracy was significantly lower in CNS tumors compared to non-CNS tumors.


Asunto(s)
Inteligencia Artificial , Neoplasias , Humanos , Cabeza , Encéfalo , Cuello
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA