Information extraction from medical case reports using OpenAI InstructGPT.

Sciannameo, Veronica; Pagliari, Daniele Jahier; Urru, Sara; Grimaldi, Piercesare; Ocagli, Honoria; Ahsani-Nasab, Sara; Comoretto, Rosanna Irene; Gregori, Dario; Berchialla, Paola

Sciannameo, Veronica; Pagliari, Daniele Jahier; Urru, Sara; Grimaldi, Piercesare; Ocagli, Honoria; Ahsani-Nasab, Sara; Comoretto, Rosanna Irene; Gregori, Dario; Berchialla, Paola.

Afiliação

Sciannameo V; Centre for Biostatistics, Epidemiology and Public Health, Department of Clinical and Biological Sciences, University of Turin, Regione Gonzole 10, Orbassano 10043, Italy.
Pagliari DJ; Department of Control and Computer Engineering, Politecnico di Torino, Turin 10129, Italy.
Urru S; Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, Padua, Italy.
Grimaldi P; Department of Public Health and Pediatrics, University of Torino, Via Santena 5 bis, Torino 10126, Italy.
Ocagli H; Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, Padua, Italy.
Ahsani-Nasab S; Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, Padua, Italy.
Comoretto RI; Department of Public Health and Pediatrics, University of Torino, Via Santena 5 bis, Torino 10126, Italy.
Gregori D; Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, Padua, Italy.
Berchialla P; Centre for Biostatistics, Epidemiology and Public Health, Department of Clinical and Biological Sciences, University of Turin, Regione Gonzole 10, Orbassano 10043, Italy. Electronic address: paola.berchialla@unito.it.

Comput Methods Programs Biomed ; 255: 108326, 2024 Oct.

Article em En | MEDLINE | ID: mdl-39029416

ABSTRACT

ABSTRACT

BACKGROUND AND

OBJECTIVE:

Researchers commonly use automated solutions such as Natural Language Processing (NLP) systems to extract clinical information from large volumes of unstructured data. However, clinical text's poor semantic structure and domain-specific vocabulary can make it challenging to develop a one-size-fits-all solution. Large Language Models (LLMs), such as OpenAI's Generative Pre-Trained Transformer 3 (GPT-3), offer a promising solution for capturing and standardizing unstructured clinical information. This study evaluated the performance of InstructGPT, a family of models derived from LLM GPT-3, to extract relevant patient information from medical case reports and discussed the advantages and disadvantages of LLMs versus dedicated NLP methods.

METHODS:

In this paper, 208 articles related to case reports of foreign body injuries in children were identified by searching PubMed, Scopus, and Web of Science. A reviewer manually extracted information on sex, age, the object that caused the injury, and the injured body part for each patient to build a gold standard to compare the performance of InstructGPT.

RESULTS:

InstructGPT achieved high accuracy in classifying the sex, age, object and body part involved in the injury, with 94%, 82%, 94% and 89%, respectively. When excluding articles for which InstructGPT could not retrieve any information, the accuracy for determining the child's sex and age improved to 97%, and the accuracy for identifying the injured body part improved to 93%. InstructGPT was also able to extract information from non-English language articles.

CONCLUSIONS:

The study highlights that LLMs have the potential to eliminate the necessity for task-specific training (zero-shot extraction), allowing the retrieval of clinical information from unstructured natural language text, particularly from published scientific literature like case reports, by directly utilizing the PDF file of the article without any pre-processing and without requiring any technical expertise in NLP or Machine Learning. The diverse nature of the corpus, which includes articles written in languages other than English, some of which contain a wide range of clinical details while others lack information, adds to the strength of the study.

Assuntos

Processamento de Linguagem Natural; Humanos; Armazenamento e Recuperação da Informação/métodos; Criança; Algoritmos; Masculino; Mineração de Dados/métodos; Feminino

Palavras-chave

Case reports; Information retrieval; Large language model; Natural language processing

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Assunto principal: Processamento de Linguagem Natural Limite: Child / Female / Humans / Male Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google