Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Acad Emerg Med ; 2024 Apr 03.
Artigo em Inglês | MEDLINE | ID: mdl-38567658

RESUMO

BACKGROUND: Natural language processing (NLP) tools including recently developed large language models (LLMs) have myriad potential applications in medical care and research, including the efficient labeling and classification of unstructured text such as electronic health record (EHR) notes. This opens the door to large-scale projects that rely on variables that are not typically recorded in a structured form, such as patient signs and symptoms. OBJECTIVES: This study is designed to acquaint the emergency medicine research community with the foundational elements of NLP, highlighting essential terminology, annotation methodologies, and the intricacies involved in training and evaluating NLP models. Symptom characterization is critical to urinary tract infection (UTI) diagnosis, but identification of symptoms from the EHR has historically been challenging, limiting large-scale research, public health surveillance, and EHR-based clinical decision support. We therefore developed and compared two NLP models to identify UTI symptoms from unstructured emergency department (ED) notes. METHODS: The study population consisted of patients aged ≥ 18 who presented to an ED in a northeastern U.S. health system between June 2013 and August 2021 and had a urinalysis performed. We annotated a random subset of 1250 ED clinician notes from these visits for a list of 17 UTI symptoms. We then developed two task-specific LLMs to perform the task of named entity recognition: a convolutional neural network-based model (SpaCy) and a transformer-based model designed to process longer documents (Clinical Longformer). Models were trained on 1000 notes and tested on a holdout set of 250 notes. We compared model performance (precision, recall, F1 measure) at identifying the presence or absence of UTI symptoms at the note level. RESULTS: A total of 8135 entities were identified in 1250 notes; 83.6% of notes included at least one entity. Overall F1 measure for note-level symptom identification weighted by entity frequency was 0.84 for the SpaCy model and 0.88 for the Longformer model. F1 measure for identifying presence or absence of any UTI symptom in a clinical note was 0.96 (232/250 correctly classified) for the SpaCy model and 0.98 (240/250 correctly classified) for the Longformer model. CONCLUSIONS: The study demonstrated the utility of LLMs and transformer-based models in particular for extracting UTI symptoms from unstructured ED clinical notes; models were highly accurate for detecting the presence or absence of any UTI symptom on the note level, with variable performance for individual symptoms.

2.
J Am Coll Emerg Physicians Open ; 5(2): e13133, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38481520

RESUMO

Objectives: This study presents a design framework to enhance the accuracy by which large language models (LLMs), like ChatGPT can extract insights from clinical notes. We highlight this framework via prompt refinement for the automated determination of HEART (History, ECG, Age, Risk factors, Troponin risk algorithm) scores in chest pain evaluation. Methods: We developed a pipeline for LLM prompt testing, employing stochastic repeat testing and quantifying response errors relative to physician assessment. We evaluated the pipeline for automated HEART score determination across a limited set of 24 synthetic clinical notes representing four simulated patients. To assess whether iterative prompt design could improve the LLMs' ability to extract complex clinical concepts and apply rule-based logic to translate them to HEART subscores, we monitored diagnostic performance during prompt iteration. Results: Validation included three iterative rounds of prompt improvement for three HEART subscores with 25 repeat trials totaling 1200 queries each for GPT-3.5 and GPT-4. For both LLM models, from initial to final prompt design, there was a decrease in the rate of responses with erroneous, non-numerical subscore answers. Accuracy of numerical responses for HEART subscores (discrete 0-2 point scale) improved for GPT-4 from the initial to final prompt iteration, decreasing from a mean error of 0.16-0.10 (95% confidence interval: 0.07-0.14) points. Conclusion: We established a framework for iterative prompt design in the clinical space. Although the results indicate potential for integrating LLMs in structured clinical note analysis, translation to real, large-scale clinical data with appropriate data privacy safeguards is needed.

3.
J Clin Transl Sci ; 8(1): e53, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38544748

RESUMO

Background: Incarceration is a significant social determinant of health, contributing to high morbidity, mortality, and racialized health inequities. However, incarceration status is largely invisible to health services research due to inadequate clinical electronic health record (EHR) capture. This study aims to develop, train, and validate natural language processing (NLP) techniques to more effectively identify incarceration status in the EHR. Methods: The study population consisted of adult patients (≥ 18 y.o.) who presented to the emergency department between June 2013 and August 2021. The EHR database was filtered for notes for specific incarceration-related terms, and then a random selection of 1,000 notes was annotated for incarceration and further stratified into specific statuses of prior history, recent, and current incarceration. For NLP model development, 80% of the notes were used to train the Longformer-based and RoBERTa algorithms. The remaining 20% of the notes underwent analysis with GPT-4. Results: There were 849 unique patients across 989 visits in the 1000 annotated notes. Manual annotation revealed that 559 of 1000 notes (55.9%) contained evidence of incarceration history. ICD-10 code (sensitivity: 4.8%, specificity: 99.1%, F1-score: 0.09) demonstrated inferior performance to RoBERTa NLP (sensitivity: 78.6%, specificity: 73.3%, F1-score: 0.79), Longformer NLP (sensitivity: 94.6%, specificity: 87.5%, F1-score: 0.93), and GPT-4 (sensitivity: 100%, specificity: 61.1%, F1-score: 0.86). Conclusions: Our advanced NLP models demonstrate a high degree of accuracy in identifying incarceration status from clinical notes. Further research is needed to explore their scaled implementation in population health initiatives and assess their potential to mitigate health disparities through tailored system interventions.

6.
J Biomed Inform ; 141: 104360, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37061014

RESUMO

Physician progress notes are frequently organized into Subjective, Objective, Assessment, and Plan (SOAP) sections. The Assessment section synthesizes information recorded in the Subjective and Objective sections, and the Plan section documents tests and treatments to narrow the differential diagnosis and manage symptoms. Classifying the relationship between the Assessment and Plan sections has been suggested to provide valuable insight into clinical reasoning. In this work, we use a novel human-in-the-loop pipeline to classify the relationships between the Assessment and Plan sections of SOAP notes as a part of the n2c2 2022 Track 3 Challenge. In particular, we use a clinical information model constructed from both the entailment logic expected from the aforementioned Challenge and the problem-oriented medical record. This information model is used to label named entities as primary and secondary problems/symptoms, events and complications in all four SOAP sections. We iteratively train separate Named Entity Recognition models and use them to annotate entities in all notes/sections. We fine-tune a downstream RoBERTa-large model to classify the Assessment-Plan relationship. We evaluate multiple language model architectures, preprocessing parameters, and methods of knowledge integration, achieving a maximum macro-F1 score of 82.31%. Our initial model achieves top-2 performance during the challenge (macro-F1: 81.52%, competitors' macro-F1 range: 74.54%-82.12%). We improved our model by incorporating post-challenge annotations (S&O sections), outperforming the top model from the Challenge. We also used Shapley additive explanations to investigate the extent of language model clinical logic, under the lens of our clinical information model. We find that the model often uses shallow heuristics and nonspecific attention when making predictions, suggesting language model knowledge integration requires further research.


Assuntos
Médicos , Humanos , Atenção , Registros Eletrônicos de Saúde , Registros , Processamento de Linguagem Natural
7.
JMIR Med Educ ; 9: e45312, 2023 02 08.
Artigo em Inglês | MEDLINE | ID: mdl-36753318

RESUMO

BACKGROUND: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. OBJECTIVE: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. METHODS: We used 2 sets of multiple-choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT's performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. RESULTS: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT's answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. CONCLUSIONS: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT's capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.

8.
J Am Med Inform Assoc ; 28(7): 1383-1392, 2021 07 14.
Artigo em Inglês | MEDLINE | ID: mdl-33822970

RESUMO

OBJECTIVE: To derive 7 proposed core electronic health record (EHR) use metrics across 2 healthcare systems with different EHR vendor product installations and examine factors associated with EHR time. MATERIALS AND METHODS: A cross-sectional analysis of ambulatory physicians EHR use across the Yale-New Haven and MedStar Health systems was performed for August 2019 using 7 proposed core EHR use metrics normalized to 8 hours of patient scheduled time. RESULTS: Five out of 7 proposed metrics could be measured in a population of nonteaching, exclusively ambulatory physicians. Among 573 physicians (Yale-New Haven N = 290, MedStar N = 283) in the analysis, median EHR-Time8 was 5.23 hours. Gender, additional clinical hours scheduled, and certain medical specialties were associated with EHR-Time8 after adjusting for age and health system on multivariable analysis. For every 8 hours of scheduled patient time, the model predicted these differences in EHR time (P < .001, unless otherwise indicated): female physicians +0.58 hours; each additional clinical hour scheduled per month -0.01 hours; practicing cardiology -1.30 hours; medical subspecialties -0.89 hours (except gastroenterology, P = .002); neurology/psychiatry -2.60 hours; obstetrics/gynecology -1.88 hours; pediatrics -1.05 hours (P = .001); sports/physical medicine and rehabilitation -3.25 hours; and surgical specialties -3.65 hours. CONCLUSIONS: For every 8 hours of scheduled patient time, ambulatory physicians spend more than 5 hours on the EHR. Physician gender, specialty, and number of clinical hours practicing are associated with differences in EHR time. While audit logs remain a powerful tool for understanding physician EHR use, additional transparency, granularity, and standardization of vendor-derived EHR use data definitions are still necessary to standardize EHR use measurement.


Assuntos
Medicina , Médicos , Criança , Estudos Transversais , Registros Eletrônicos de Saúde , Estudos de Viabilidade , Feminino , Humanos
9.
Stud Health Technol Inform ; 264: 1590-1591, 2019 Aug 21.
Artigo em Inglês | MEDLINE | ID: mdl-31438246

RESUMO

Epilepsy is a serious neurological disorder that affects nearly 60 million individuals worldwide and it is characterized by repeated seizures. Graph theoretic approaches have been developed to analyze functional brain networks that underpin epileptogenic network. We have developed a Web-based application that enables neuroscientists to process high resolution Stereotactic Electroencephalogram (SEEG) signal data and compute various kinds of signal coupling measures using an intuitive user interface for study of epilepsy seizure networks. Results of a systematic evaluation of this new application show that it scales with increasing volume of signal data.


Assuntos
Eletroencefalografia , Epilepsia , Encéfalo , Humanos , Internet , Processamento de Sinais Assistido por Computador
10.
AMIA Annu Symp Proc ; 2017: 1705-1714, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29854241

RESUMO

Scientific reproducibility is critical for biomedical research as it enables us to advance science by building on previous results, helps ensure the success of increasingly expensive drug trials, and allows funding agencies to make informed decisions. However, there is a growing "crisis" of reproducibility as evidenced by a recent Nature journal survey of more than 1500 researchers that found that 70% of researchers were not able to replicate results from other research groups and more than 50% of researchers were not able reproduce their own research results. In 2016, the National Institutes of Health (NIH) announced the "Rigor and Reproducibility" guidelines to support reproducibility in biomedical research. A key component of the NIH Rigor and Reproducibility guidelines is the recording and analysis of "provenance" information, which describes the origin or history of data and plays a central role in ensuring scientific reproducibility. As part of the NIH Big Data to Knowledge (BD2K)-funded data provenance project, we have developed a new informatics framework called Provenance for Clinical and Healthcare Research (ProvCaRe) to extract, model, and analyze provenance information from published literature describing research studies. Using sleep medicine research studies that have made their data available through the National Sleep Research Resource (NSRR), we have developed an automated pipeline to identify and extract provenance metadata from published literature that is made available for analysis in the ProvCaRe knowledgebase. NSRR is the largest repository of sleep data from over 40,000 studies involving 36,000 participants and we used 75 published articles describing 6 research studies to populate the ProvCaRe knowledgebase. We evaluated the ProvCaRe knowledgebase with 28,474 "provenance triples" using hypothesis-driven queries to identify and rank research studies based on the provenance information extracted from published articles.


Assuntos
Pesquisa Biomédica/normas , Bases de Conhecimento , Metadados , Reprodutibilidade dos Testes , Algoritmos , Ontologias Biológicas , Guias como Assunto , Pesquisa sobre Serviços de Saúde/normas , Humanos , National Institutes of Health (U.S.) , Semântica , Sono , Estados Unidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA