Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 317
Filtrar
2.
Nucleic Acids Res ; 2024 Apr 04.
Artículo en Inglés | MEDLINE | ID: mdl-38572754

RESUMEN

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

3.
J Biomed Inform ; 153: 104640, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38608915

RESUMEN

Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.


Asunto(s)
Inteligencia Artificial , Medicina Basada en la Evidencia , Humanos , Confianza , Procesamiento de Lenguaje Natural
4.
J Med Internet Res ; 26: e56655, 2024 Apr 17.
Artículo en Inglés | MEDLINE | ID: mdl-38630520

RESUMEN

BACKGROUND: Although patients have easy access to their electronic health records and laboratory test result data through patient portals, laboratory test results are often confusing and hard to understand. Many patients turn to web-based forums or question-and-answer (Q&A) sites to seek advice from their peers. The quality of answers from social Q&A sites on health-related questions varies significantly, and not all responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to have their questions answered. OBJECTIVE: We aimed to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to laboratory test-related questions asked by patients and identify potential issues that can be mitigated using augmentation approaches. METHODS: We collected laboratory test result-related Q&A data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from 5 LLMs: GPT-4, GPT-3.5, LLaMA 2, MedAlpaca, and ORCA_mini. We assessed the similarity of their answers using standard Q&A similarity-based evaluation metrics, including Recall-Oriented Understudy for Gisting Evaluation, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Bidirectional Encoder Representations from Transformers Score. We used an LLM-based evaluator to judge whether a target model had higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. We performed a manual evaluation with medical experts for all the responses to 7 selected questions on the same 4 aspects. RESULTS: Regarding the similarity of the responses from 4 LLMs; the GPT-4 output was used as the reference answer, the responses from GPT-3.5 were the most similar, followed by those from LLaMA 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored the lowest and, thus, as the least similar to GPT-4-generated answers. The results of the win rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all 4 aspects (relevance, correctness, helpfulness, and safety). LLM responses occasionally also suffered from lack of interpretation in one's medical context, incorrect statements, and lack of references. CONCLUSIONS: By evaluating LLMs in generating responses to patients' laboratory test result-related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4's responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation.


Asunto(s)
Camélidos del Nuevo Mundo , Humanos , Animales , Benchmarking , Registros Electrónicos de Salud , Ingeniería , Lenguaje
5.
J Biomed Inform ; 154: 104646, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38677633

RESUMEN

OBJECTIVES: Artificial intelligence (AI) systems have the potential to revolutionize clinical practices, including improving diagnostic accuracy and surgical decision-making, while also reducing costs and manpower. However, it is important to recognize that these systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender. Such biases can occur before, during, or after the development of AI models, making it critical to understand and address potential biases to enable the accurate and reliable application of AI models in clinical settings. To mitigate bias concerns during model development, we surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV). Then we discussed the methods, such as data perturbation and adversarial learning, that have been applied in the biomedical domain to address bias. METHODS: We performed our literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords. We then filtered the result of 10,041 articles automatically with loose constraints, and manually inspected the abstracts of the remaining 890 articles to identify the 55 articles included in this review. Additional articles in the references are also included in this review. We discuss each method and compare its strengths and weaknesses. Finally, we review other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness. RESULTS: The bias of AIs in biomedicine can originate from multiple sources such as insufficient data, sampling bias and the use of health-irrelevant features or race-adjusted algorithms. Existing debiasing methods that focus on algorithms can be categorized into distributional or algorithmic. Distributional methods include data augmentation, data perturbation, data reweighting methods, and federated learning. Algorithmic approaches include unsupervised representation learning, adversarial learning, disentangled representation learning, loss-based methods and causality-based methods.


Asunto(s)
Inteligencia Artificial , Sesgo , Procesamiento de Lenguaje Natural , Humanos , Encuestas y Cuestionarios , Aprendizaje Automático , Algoritmos
6.
Ophthalmology ; 2024 Apr 23.
Artículo en Inglés | MEDLINE | ID: mdl-38657840

RESUMEN

PURPOSE: To update the Age-Related Eye Disease Study (AREDS) simplified severity scale for risk of late age-related macular degeneration (AMD), including incorporation of reticular pseudodrusen (RPD), and to perform external validation on the Age-Related Eye Disease Study 2 (AREDS2). DESIGN: Post hoc analysis of 2 clinical trial cohorts: AREDS and AREDS2. PARTICIPANTS: Participants with no late AMD in either eye at baseline in AREDS (n = 2719) and AREDS2 (n = 1472). METHODS: Five-year rates of progression to late AMD were calculated according to levels 0 to 4 on the simplified severity scale after 2 updates: (1) noncentral geographic atrophy (GA) considered part of the outcome, rather than a risk feature, and (2) scale separation according to RPD status (determined by validated deep learning grading of color fundus photographs). MAIN OUTCOME MEASURES: Five-year rate of progression to late AMD (defined as neovascular AMD or any GA). RESULTS: In the AREDS, after the first scale update, the 5-year rates of progression to late AMD for levels 0 to 4 were 0.3%, 4.5%, 12.9%, 32.2%, and 55.6%, respectively. As the final simplified severity scale, the 5-year progression rates for levels 0 to 4 were 0.3%, 4.3%, 11.6%, 26.7%, and 50.0%, respectively, for participants without RPD at baseline and 2.8%, 8.0%, 29.0%, 58.7%, and 72.2%, respectively, for participants with RPD at baseline. In external validation on the AREDS2, for levels 2 to 4, the progression rates were similar: 15.0%, 27.7%, and 45.7% (RPD absent) and 26.2%, 46.0%, and 73.0% (RPD present), respectively. CONCLUSIONS: The AREDS AMD simplified severity scale has been modernized with 2 important updates. The new scale for individuals without RPD has 5-year progression rates of approximately 0.5%, 4%, 12%, 25%, and 50%, such that the rates on the original scale remain accurate. The new scale for individuals with RPD has 5-year progression rates of approximately 3%, 8%, 30%, 60%, and 70%, that is, approximately double for most levels. This scale fits updated definitions of late AMD, has increased prognostic accuracy, seems generalizable to similar populations, but remains simple for broad risk categorization. FINANCIAL DISCLOSURE(S): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

7.
Sci Total Environ ; 929: 172646, 2024 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-38653417

RESUMEN

Agroforestry waste and cow manure pollute the environment, of which, agroforestry waste is difficult to degrade. Compost is an effective way to dispose agroforestry waste; however, the low degradation efficiency of lignocellulose in agroforestry waste affects the process of composting humification. This study investigated lignocellulose degradation and composting humification in full-size apple wood and cow manure composting processes by applying different pretreatments (acidic, alkaline, and high-temperature) to apple wood. Simultaneously, physicochemical characterization and metagenome sequencing were combined to analyze the function of carbohydrate-active enzymes database (CAZy). Therefore, microbial communities and functions were linked during the composting process and the lignocellulose degradation mechanism was elaborated. The results showed that the addition of apple wood increased the compost humus (HS) yield, and pretreatment of apple wood enhanced the lignocellulose degradation during composting processes. In addition, pretreatment improved the physicochemical properties, such as temperature, pH, electric conductivity (EC), ammonium nitrogen (NH4+), and nitrate nitrogen (NO3-) in the compost, of which, acid treated apple wood compost (AcAWC) achieved the highest temperature of 58.4 °C, effectively promoting nitrification with NO3- ultimately reaching 0.127 g/kg. In all composts, microbial networks constructed a high proportion of positively correlated connections, and microorganisms promoted the composting process through cooperation. The proportions of glycosyltransferase (GT) and glycoside hydrolase (GH) promoted the separation and degradation of lignocellulose during composting to form HS. Notably, the adverse effects of the alkali-treated apple wood compost on bacteria were greater. AcAWC showed significant correlations between bacterial and fungal communities and both lignin and hemicellulose, and had more biomarkers associated with lignocellulose degradation and humification. The lignin degradation rate was 24.57 % and the HS yield increased by 27.49 %. Therefore, AcAWC has been confirmed to enhance lignocellulose degradation and promote compost humification by altering the properties of the apple wood and establishing a richer microbial community.


Asunto(s)
Compostaje , Lignina , Malus , Estiércol , Madera , Lignina/metabolismo , Animales , Bovinos , Biomasa , Sustancias Húmicas , Biodegradación Ambiental
8.
Bioinformatics ; 40(4)2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38514400

RESUMEN

MOTIVATION: Large Language Models (LLMs) have the potential to revolutionize the field of Natural Language Processing, excelling not only in text generation and reasoning tasks but also in their ability for zero/few-shot learning, swiftly adapting to new tasks with minimal fine-tuning. LLMs have also demonstrated great promise in biomedical and healthcare applications. However, when it comes to Named Entity Recognition (NER), particularly within the biomedical domain, LLMs fall short of the effectiveness exhibited by fine-tuned domain-specific models. One key reason is that NER is typically conceptualized as a sequence labeling task, whereas LLMs are optimized for text generation and reasoning tasks. RESULTS: We developed an instruction-based learning paradigm that transforms biomedical NER from a sequence labeling task into a generation task. This paradigm is end-to-end and streamlines the training and evaluation process by automatically repurposing pre-existing biomedical NER datasets. We further developed BioNER-LLaMA using the proposed paradigm with LLaMA-7B as the foundational LLM. We conducted extensive testing on BioNER-LLaMA across three widely recognized biomedical NER datasets, consisting of entities related to diseases, chemicals, and genes. The results revealed that BioNER-LLaMA consistently achieved higher F1-scores ranging from 5% to 30% compared to the few-shot learning capabilities of GPT-4 on datasets with different biomedical entities. We show that a general-domain LLM can match the performance of rigorously fine-tuned PubMedBERT models and PMC-LLaMA, biomedical-specific language model. Our findings underscore the potential of our proposed paradigm in developing general-domain LLMs that can rival SOTA performances in multi-task, multi-domain scenarios in biomedical and health applications. AVAILABILITY AND IMPLEMENTATION: Datasets and other resources are available at https://github.com/BIDS-Xu-Lab/BioNER-LLaMA.


Asunto(s)
Camélidos del Nuevo Mundo , Aprendizaje Profundo , Animales , Lenguaje , Procesamiento de Lenguaje Natural
9.
ArXiv ; 2024 Jan 23.
Artículo en Inglés | MEDLINE | ID: mdl-38529075

RESUMEN

Background: Even though patients have easy access to their electronic health records and lab test results data through patient portals, lab results are often confusing and hard to understand. Many patients turn to online forums or question and answering (Q&A) sites to seek advice from their peers. However, the quality of answers from social Q&A on health-related questions varies significantly, and not all the responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to get their questions answered. Objective: We aim to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to lab test-related questions asked by patients and to identify potential issues that can be mitigated with augmentation approaches. Methods: We first collected lab test results related question and answer data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from four LLMs including GPT-4, Meta LLaMA 2, MedAlpaca, and ORCA_mini. We first assessed the similarity of their answers using standard QA similarity-based evaluation metrics including ROUGE, BLEU, METEOR, BERTScore. We also utilized an LLM-based evaluator to judge whether a target model has higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. Finally, we performed a manual evaluation with medical experts for all the responses of seven selected questions on the same four aspects. Results: Regarding the similarity of the responses from 4 LLMs, where GPT-4 output was used as the reference answer, the responses from LLaMa 2 are the most similar ones, followed by LLaMa 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored lowest and thus least similar to GPT-4-generated answers. The results of Win Rate and medical expert evaluation both showed that GPT-4's responses achieved better scores than all the other LLM responses and human responses on all the four aspects (relevance, correctness, helpfulness, and safety). However, LLM responses occasionally also suffer from lack of interpretation in one's medical context, incorrect statements, and lack of references. Conclusions: By evaluating LLMs in generating responses to patients' lab test results related questions, we find that compared to other three LLMs and human answer from the Q&A website, GPT-4's responses are more accurate, helpful, relevant, and safer. However, there are cases that GPT-4 responses are inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses including prompt engineering, prompt augmentation, retrieval augmented generation, and response evaluation.

10.
Comput Med Imaging Graph ; 114: 102363, 2024 06.
Artículo en Inglés | MEDLINE | ID: mdl-38447381

RESUMEN

Reliable localization of lymph nodes (LNs) in multi-parametric MRI (mpMRI) studies plays a major role in the assessment of lymphadenopathy and staging of metastatic disease. Radiologists routinely measure the nodal size in order to distinguish benign from malignant nodes, which require subsequent cancer staging. However, identification of lymph nodes is a cumbersome task due to their myriad appearances in mpMRI studies. Multiple sequences are acquired in mpMRI studies, including T2 fat suppressed (T2FS) and diffusion weighted imaging (DWI) sequences among others; consequently, the sizing of LNs is rendered challenging due to the variety of signal intensities in these sequences. Furthermore, radiologists can miss potentially metastatic LNs during a busy clinical day. To lighten these imaging and workflow challenges, we propose a computer-aided detection (CAD) pipeline to detect both benign and malignant LNs in the body for their subsequent measurement. We employed the recently proposed Dynamic Head (DyHead) neural network to detect LNs in mpMRI studies that were acquired using a variety of scanners and exam protocols. The T2FS and DWI series were co-registered, and a selective augmentation technique called Intra-Label LISA (ILL) was used to blend the two volumes with the interpolation factor drawn from a Beta distribution. In this way, ILL diversified the samples that the model encountered during the training phase, while the requirement for both sequences to be present at test time was nullified. Our results showed a mean average precision (mAP) of 53.5% and a sensitivity of ∼78% with ILL at 4 FP/vol. This corresponded to an improvement of ≥10% in mAP and ≥12% in sensitivity at 4FP (p ¡ 0.05) respectively over current LN detection approaches evaluated on the same dataset. We also established the out-of-distribution robustness of the DyHead model by training it on data acquired by a Siemens Aera scanner and testing it on data from the Siemens Verio, Siemens Biograph mMR, and Philips Achieva scanners. Our pilot work represents an important first step towards automated detection, segmentation, and classification of lymph nodes in mpMRI.


Asunto(s)
Imágenes de Resonancia Magnética Multiparamétrica , Humanos , Metástasis Linfática/diagnóstico por imagen , Metástasis Linfática/patología , Imagen de Difusión por Resonancia Magnética/métodos , Ganglios Linfáticos/diagnóstico por imagen , Estadificación de Neoplasias
11.
ArXiv ; 2024 Feb 13.
Artículo en Inglés | MEDLINE | ID: mdl-38529077

RESUMEN

Objectives: Artificial intelligence (AI) systems have the potential to revolutionize clinical practices, including improving diagnostic accuracy and surgical decision-making, while also reducing costs and manpower. However, it is important to recognize that these systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender. Such biases can occur before, during, or after the development of AI models, making it critical to understand and address potential biases to enable the accurate and reliable application of AI models in clinical settings. To mitigate bias concerns during model development, we surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV). Then we discussed the methods, such as data perturbation and adversarial learning, that have been applied in the biomedical domain to address bias. Methods: We performed our literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords. We then filtered the result of 10,041 articles automatically with loose constraints, and manually inspected the abstracts of the remaining 890 articles to identify the 55 articles included in this review. Additional articles in the references are also included in this review. We discuss each method and compare its strengths and weaknesses. Finally, we review other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness. Results: The bias of AIs in biomedicine can originate from multiple sources such as insufficient data, sampling bias and the use of health-irrelevant features or race-adjusted algorithms. Existing debiasing methods that focus on algorithms can be categorized into distributional or algorithmic. Distributional methods include data augmentation, data perturbation, data reweighting methods, and federated learning. Algorithmic approaches include unsupervised representation learning, adversarial learning, disentangled representation learning, loss-based methods and causality-based methods.

12.
J Am Chem Soc ; 146(12): 7950-7955, 2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38483267

RESUMEN

Single-site catalysts (SSCs) achieve a high catalytic performance through atomically dispersed active sites. A challenge facing the development of SSCs is aggregation of active catalytic species. Reducing the loading of these sites to very low levels is a common strategy to mitigate aggregation and sintering; however, this limits the tools that can be used to characterize the SSCs. Here we report a sintering-resistant SSC with high loading that is achieved by incorporating Anderson-Evans polyoxometalate clusters (POMs, MMo6O24, M = Rh/Pt) within NU-1000, a Zr-based metal-organic framework (MOF). The dual confinement provided by isolating the active site within the POM, then isolating the POMs within the MOF, facilitates the formation of isolated noble metal sites with low coordination numbers via exsolution from the POM during activation. The high loading (up to 3.2 wt %) that can be achieved without sintering allowed the local structure transformation in the POM cluster and the surrounding MOF to be evaluated using in situ X-ray scattering with pair distribution function (PDF) analysis. Notably, the Rh/Pt···Mo distance in the active catalyst is shorter than the M···M bond lengths in the respective bulk metals. Models of the active cluster structure were identified based on the PDF data with complementary computation and X-ray absorption spectroscopy analysis.

13.
ArXiv ; 2024 Jan 19.
Artículo en Inglés | MEDLINE | ID: mdl-38410657

RESUMEN

PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases, and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.

14.
ArXiv ; 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38410646

RESUMEN

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

15.
EBioMedicine ; 100: 104988, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38306900

RESUMEN

Biomedical research yields vast information, much of which is only accessible through the literature. Consequently, literature search is crucial for healthcare and biomedicine. Recent improvements in artificial intelligence (AI) have expanded functionality beyond keywords, but they might be unfamiliar to clinicians and researchers. In response, we present an overview of over 30 literature search tools tailored to common biomedical use cases, aiming at helping readers efficiently fulfill their information needs. We first discuss recent improvements and continued challenges of the widely used PubMed. Then, we describe AI-based literature search tools catering to five specific information needs: 1. Evidence-based medicine. 2. Precision medicine and genomics. 3. Searching by meaning, including questions. 4. Finding related articles with literature recommendation. 5. Discovering hidden associations through literature mining. Finally, we discuss the impacts of recent developments of large language models such as ChatGPT on biomedical information seeking.


Asunto(s)
Inteligencia Artificial , Investigación Biomédica , Humanos , Minería de Datos , PubMed , Atención a la Salud
16.
Bioinformatics ; 40(2)2024 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-38341654

RESUMEN

MOTIVATION: While large language models (LLMs) have been successfully applied to various tasks, they still face challenges with hallucinations. Augmenting LLMs with domain-specific tools such as database utilities can facilitate easier and more precise access to specialized knowledge. In this article, we present GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI) for answering genomics questions. Specifically, we prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm that can detect and execute API calls. RESULTS: Experimental results show that GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83, largely surpassing retrieval-augmented LLMs such as the new Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as GPT-3 (0.16) and ChatGPT (0.12). Our further analyses suggest that: First, API demonstrations have good cross-task generalizability and are more useful than documentations for in-context learning; second, GeneGPT can generalize to longer chains of API calls and answer multi-hop questions in GeneHop, a novel dataset introduced in this work; finally, different types of errors are enriched in different tasks, providing valuable insights for future improvements. AVAILABILITY AND IMPLEMENTATION: The GeneGPT code and data are publicly available at https://github.com/ncbi/GeneGPT.


Asunto(s)
Algoritmos , Benchmarking , Bases de Datos Factuales , Documentación , Lenguaje
17.
ArXiv ; 2024 Jan 25.
Artículo en Inglés | MEDLINE | ID: mdl-38410650

RESUMEN

Large language models like GPT-3.5-turbo and GPT-4 hold promise for healthcare professionals, but they may inadvertently inherit biases during their training, potentially affecting their utility in medical applications. Despite few attempts in the past, the precise impact and extent of these biases remain uncertain. Through both qualitative and quantitative analyses, we find that these models tend to project higher costs and longer hospitalizations for White populations and exhibit optimistic views in challenging medical scenarios with much higher survival rates. These biases, which mirror real-world healthcare disparities, are evident in the generation of patient backgrounds, the association of specific diseases with certain races, and disparities in treatment recommendations, etc. Our findings underscore the critical need for future research to address and mitigate biases in language models, especially in critical healthcare applications, to ensure fair and accurate outcomes for all patients.

18.
Artículo en Inglés | MEDLINE | ID: mdl-38281112

RESUMEN

IMPORTANCE: The study highlights the potential of large language models, specifically GPT-3.5 and GPT-4, in processing complex clinical data and extracting meaningful information with minimal training data. By developing and refining prompt-based strategies, we can significantly enhance the models' performance, making them viable tools for clinical NER tasks and possibly reducing the reliance on extensive annotated datasets. OBJECTIVES: This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance. MATERIALS AND METHODS: We evaluated these models on 2 clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) to identify nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT. RESULTS: Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all 4 components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed. DISCUSSION: The study's findings suggest a promising direction in leveraging LLMs for clinical NER tasks. However, while the performance of GPT models improved with task-specific prompts, there's a need for further development and refinement. LLMs like GPT-4 show potential in achieving close performance to state-of-the-art models like BioClinicalBERT, but they still require careful prompt engineering and understanding of task-specific knowledge. The study also underscores the importance of evaluation schemas that accurately reflect the capabilities and performance of LLMs in clinical settings. CONCLUSION: While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.

19.
Small ; 20(3): e2305881, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37670528

RESUMEN

Core-shell metal-organic frameworks (MOF@MOF) are promising materials with sophisticated structures that cannot only enhance the properties of MOFs but also endow them with new functions. The growth of isotopic lcore-shell MOFs is mostly limited to inconvenient stepwise seeding strategies with strict requirements, and by far one-pot synthesis is still of great challenge due to the interference of different components. Through two pairs of isoreticular MOFs, it reveals that the structural incompatibility is a prerequisite for the formation of MOFs@MOFs by one-pot synthesis, as illustrated by PMOF-3@HHU-9. It further unveils that the adaptability of the shell-MOF is a more key factor for nucleation kinetic control. MOFs with flexible linkers has comparably slower nucleation than MOFs with rigid linkers (forming PMOF-3@NJU-Bai21), and structural-flexible MOFs built by flexible linkers show the lowest nucleation and the most adaptability (affording NJU-Bai21@HHU-9). This degree of adaptability variation controls the sequence and further facilitates the synthesis of a first triple-layered core-shell MOF (PMOF-3@NJU-Bai21@HHU-9) by one-pot synthesis. The insight gained from this study will aid in the rational design and synthesis of other multi-shelled structures by one-pot synthesis and the further expansion of their applications.

20.
Int J Comput Assist Radiol Surg ; 19(1): 163-170, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37326816

RESUMEN

PURPOSE: Reliable measurement of lymph nodes (LNs) in multi-parametric MRI (mpMRI) studies of the body plays a major role in the assessment of lymphadenopathy and staging of metastatic disease. Previous approaches do not adequately exploit the complementary sequences in mpMRI to universally detect and segment lymph nodes, and they have shown fairly limited performance. METHODS: We propose a computer-aided detection and segmentation pipeline to leverage the T2 fat-suppressed (T2FS) and diffusion-weighted imaging (DWI) series from a mpMRI study. The T2FS and DWI series in 38 studies (38 patients) were co-registered and blended together using a selective data augmentation technique, such that traits of both series were visible in the same volume. A mask RCNN model was subsequently trained for universal detection and segmentation of 3D LNs. RESULTS: Experiments on 18 test mpMRI studies revealed that the proposed pipeline achieved a precision of [Formula: see text]%, sensitivity of [Formula: see text]% at 4 false positives (FP) per volume, and dice score of [Formula: see text]%. This represented an improvement of [Formula: see text]% in precision, [Formula: see text]% in sensitivity at 4 FP/volume, and [Formula: see text]% in dice score, respectively, over current approaches evaluated on the same dataset. CONCLUSION: Our pipeline universally detected and segmented both metastatic and non-metastatic nodes in mpMRI studies. At test time, the input data used by the trained model could either be the T2FS series alone or a blend of co-registered T2FS and DWI series. Contrary to prior work, this eliminated the reliance on both the T2FS and DWI series in a mpMRI study.


Asunto(s)
Imágenes de Resonancia Magnética Multiparamétrica , Humanos , Imagen de Difusión por Resonancia Magnética/métodos , Pulmón , Mediastino , Ganglios Linfáticos/diagnóstico por imagen , Ganglios Linfáticos/patología
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA