Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 7.162
Filtrar
1.
Sci Data ; 11(1): 982, 2024 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-39251610

RESUMO

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.


Assuntos
Enzimas , Processamento de Linguagem Natural , Enzimas/química , PubMed , Bases de Dados de Proteínas , Bases de Conhecimento
2.
BMC Med Inform Decis Mak ; 24(Suppl 5): 262, 2024 Sep 17.
Artigo em Inglês | MEDLINE | ID: mdl-39289714

RESUMO

BACKGROUND: Applying graph convolutional networks (GCN) to the classification of free-form natural language texts leveraged by graph-of-words features (TextGCN) was studied and confirmed to be an effective means of describing complex natural language texts. However, the text classification models based on the TextGCN possess weaknesses in terms of memory consumption and model dissemination and distribution. In this paper, we present a fast message passing network (FastMPN), implementing a GCN with message passing architecture that provides versatility and flexibility by allowing trainable node embedding and edge weights, helping the GCN model find the better solution. We applied the FastMPN model to the task of clinical information extraction from cancer pathology reports, extracting the following six properties: main site, subsite, laterality, histology, behavior, and grade. RESULTS: We evaluated the clinical task performance of the FastMPN models in terms of micro- and macro-averaged F1 scores. A comparison was performed with the multi-task convolutional neural network (MT-CNN) model. Results show that the FastMPN model is equivalent to or better than the MT-CNN. CONCLUSIONS: Our implementation revealed that our FastMPN model, which is based on the PyTorch platform, can train a large corpus (667,290 training samples) with 202,373 unique words in less than 3 minutes per epoch using one NVIDIA V100 hardware accelerator. Our experiments demonstrated that using this implementation, the clinical task performance scores of information extraction related to tumors from cancer pathology reports were highly competitive.


Assuntos
Processamento de Linguagem Natural , Neoplasias , Redes Neurais de Computação , Humanos , Neoplasias/classificação , Mineração de Dados
3.
Bioinformatics ; 40(Suppl 2): ii45-ii52, 2024 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-39230709

RESUMO

MOTIVATION: Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. RESULTS: In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%). AVAILABILITY AND IMPLEMENTATION: All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.


Assuntos
Aprendizado Profundo , Bases de Dados Factuais , Dicionários como Assunto , Biologia Computacional/métodos , Processamento de Linguagem Natural , Mineração de Dados/métodos
4.
Front Public Health ; 12: 1401240, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39281082

RESUMO

Aphasia is a language disorder caused by brain injury that often results in difficulties with speech production and comprehension, significantly impacting the affected individuals' lives. Recently, artificial intelligence (AI) has been advancing in medical research. Utilizing machine learning and related technologies, AI develops sophisticated algorithms and predictive models, and can employ tools such as speech recognition and natural language processing to autonomously identify and analyze language deficits in individuals with aphasia. These advancements provide new insights and methods for assessing and treating aphasia. This article explores current AI-supported assessment and treatment approaches for aphasia and highlights key application areas. It aims to uncover how AI can enhance the process of assessment, tailor therapeutic interventions, and track the progress and outcomes of rehabilitation efforts. The article also addresses the current limitations of AI's application in aphasia and discusses prospects for future research.


Assuntos
Afasia , Inteligência Artificial , Humanos , Afasia/reabilitação , Processamento de Linguagem Natural , Aprendizado de Máquina
5.
Radiology ; 312(3): e240885, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39287525

RESUMO

Background The specialization and complexity of radiology makes the automatic generation of radiologic impressions (ie, a diagnosis with differential diagnosis and management recommendations) challenging. Purpose To develop a large language model (LLM) that generates impressions based on imaging findings and to evaluate its performance in professional and linguistic dimensions. Materials and Methods Six radiologists recorded imaging examination findings from August 2 to 31, 2023, at Shanghai General Hospital and used the developed LLM before routinely writing report impressions for multiple radiologic modalities (CT, MRI, radiography, mammography) and anatomic sites (cranium and face, neck, chest, upper abdomen, lower abdomen, vessels, bone and joint, spine, breast), making necessary corrections and completing the radiologic impression. A subset was defined to investigate cases where the LLM-generated impressions differed from the final radiologist impressions by excluding identical and highly similar cases. An expert panel scored the LLM-generated impressions on a five-point Likert scale (5 = strongly agree) based on scientific terminology, coherence, specific diagnosis, differential diagnosis, management recommendations, correctness, comprehensiveness, harmlessness, and lack of bias. Results In this retrospective study, an LLM was pretrained using 20 GB of medical and general-purpose text data. The fine-tuning data set comprised 1.5 GB of data, including 800 radiology reports with paired instructions (describing the output task in natural language) and outputs. Test set 2 included data from 3988 patients (median age, 56 years [IQR, 40-68 years]; 2159 male). The median recall, precision, and F1 score of LLM-generated impressions were 0.775 (IQR, 0.56-1), 0.84 (IQR, 0.611-1), and 0.772 (IQR, 0.578-0.957), respectively, using the final impressions as the reference standard. In a subset of 1014 patients (median age, 57 years [IQR, 42-69 years]; 528 male), the overall median expert panel score for LLM-generated impressions was 5 (IQR, 5-5), ranging from 4 (IQR, 3-5) to 5 (IQR, 5-5). Conclusion The developed LLM generated radiologic impressions that were professionally and linguistically appropriate for a full spectrum of radiology examinations. © RSNA, 2024 Supplemental material is available for this article.


Assuntos
Diagnóstico por Imagem , Humanos , Estudos Retrospectivos , Masculino , Feminino , Pessoa de Meia-Idade , Adulto , Idoso , Diagnóstico Diferencial , Diagnóstico por Imagem/métodos , Processamento de Linguagem Natural
6.
Perm J ; 28(3): 23-36, 2024 Sep 16.
Artigo em Inglês | MEDLINE | ID: mdl-39219312

RESUMO

INTRODUCTION: Rapid identification of individuals developing a psychotic spectrum disorder (PSD) is crucial because untreated psychosis is associated with poor outcomes and decreased treatment response. Lack of recognition of early psychotic symptoms often delays diagnosis, further worsening these outcomes. METHODS: The proposed study is a cross-sectional, retrospective analysis of electronic health record data including clinician documentation and patient-clinician secure messages for patients aged 15-29 years with ≥ 1 primary care encounter between 2017 and 2019 within 2 Kaiser Permanente regions. Patients with new-onset PSD will be distinguished from those without a diagnosis if they have ≥ 1 PSD diagnosis within 12 months following the primary care encounter. The prediction model will be trained using a trisourced natural language processing feature extraction design and validated both within each region separately and in a modified combined sample. DISCUSSION: This proposed model leverages the strengths of the large volume of patient-specific data from an integrated electronic health record with natural language processing to identify patients at elevated chance of developing a PSD. This project carries the potential to reduce the duration of untreated psychosis and thereby improve long-term patient outcomes.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Processamento de Linguagem Natural , Transtornos Psicóticos , Humanos , Transtornos Psicóticos/diagnóstico , Adolescente , Estudos Retrospectivos , Adulto , Adulto Jovem , Estudos Transversais , Masculino , Feminino , Atenção Primária à Saúde
7.
JCO Clin Cancer Inform ; 8: e2400134, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39265122

RESUMO

PURPOSE: Data on end-of-life care (EOLC) quality, assessed through evidence-based quality measures (QMs), are difficult to obtain. Natural language processing (NLP) enables efficient quality measurement and is not yet used for children with serious illness. We sought to validate a pediatric-specific EOLC-QM keyword library and evaluate EOLC-QM attainment among childhood cancer decedents. METHODS: In a single-center cohort of children with cancer who died between 2014 and 2022, we piloted a rule-based NLP approach to examine the content of clinical notes in the last 6 months of life. We identified documented discussions of five EOLC-QMs: goals of care, limitations to life-sustaining treatments (LLST), hospice, palliative care consultation, and preferred location of death. We assessed performance of NLP methods, compared with gold standard manual chart review. We then used NLP to characterize proportions of decedents with documented EOLC-QM discussions and timing of first documentation relative to death. RESULTS: Among 101 decedents, nearly half were minorities (Hispanic/Latinx [24%], non-Hispanic Black/African American [20%]), female (48%), or diagnosed with solid tumors (43%). Through iterative refinement, our keyword library achieved robust performance statistics (for all EOLC-QMs, F1 score = 1.0). Most decedents had documented discussions regarding goals of care (83%), LLST (83%), and hospice (74%). Fewer decedents had documented discussions regarding palliative care consultation (49%) or preferred location of death (36%). For all five EOLC-QMs, first documentation occurred, on average, >30 days before death. CONCLUSION: A high proportion of decedents attained specified EOLC-QMs more than 30 days before death. Our findings indicate that NLP is a feasible approach to measuring quality of care for children with cancer at the end of life and is ripe for multi-center research and quality improvement.


Assuntos
Processamento de Linguagem Natural , Neoplasias , Assistência Terminal , Humanos , Assistência Terminal/normas , Assistência Terminal/métodos , Neoplasias/terapia , Neoplasias/mortalidade , Criança , Feminino , Masculino , Pré-Escolar , Adolescente , Qualidade da Assistência à Saúde , Lactente , Cuidados Paliativos/métodos , Cuidados Paliativos/normas , Recém-Nascido , Registros Eletrônicos de Saúde
8.
Semin Vasc Surg ; 37(3): 314-320, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39277347

RESUMO

Natural language processing is a subfield of artificial intelligence that aims to analyze human oral or written language. The development of large language models has brought innovative perspectives in medicine, including the potential use of chatbots and virtual assistants. Nevertheless, the benefits and pitfalls of such technology need to be carefully evaluated before their use in health care. The aim of this narrative review was to provide an overview of potential applications of large language models and artificial intelligence chatbots in the field of vascular surgery, including clinical practice, research, and education. In light of the results, we discuss current limits and future directions.


Assuntos
Inteligência Artificial , Processamento de Linguagem Natural , Procedimentos Cirúrgicos Vasculares , Humanos
9.
J Biomed Semantics ; 15(1): 17, 2024 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-39277770

RESUMO

BACKGROUND: Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline. METHODS: We created 200 synthetic clinic letters based on hospital outpatient consultations with epilepsy specialists. The letters were double annotated by trained clinicians and researchers according to agreed guidelines. We used the annotation tool, Markup, with an epilepsy concept list based on the Unified Medical Language System ontology. All annotations were reviewed, and a gold standard set of annotations was agreed and used to validate the performance of ExECTv2. RESULTS: The overall inter-annotator agreement (IAA) between the two sets of annotations produced a per item F1 score of 0.73. Validating ExECTv2 using the gold standard gave an overall F1 score of 0.87 per item, and 0.90 per letter. CONCLUSION: The synthetic letters, annotations, and annotation guidelines have been made freely available. To our knowledge, this is the first publicly available set of annotated epilepsy clinic letters and guidelines that can be used for NLP researchers with minimum epilepsy knowledge. The IAA results show that clinical text annotation tasks are difficult and require a gold standard to be arranged by researcher consensus. The results for ExECTv2, our automated epilepsy NLP pipeline, extracted detailed epilepsy information from unstructured epilepsy letters with more accuracy than human annotators, further confirming the utility of NLP for clinical and research applications.


Assuntos
Epilepsia , Processamento de Linguagem Natural , Humanos , Curadoria de Dados/métodos
10.
JMIR Infodemiology ; 4: e51156, 2024 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-39269743

RESUMO

BACKGROUND: The growing availability of big data spontaneously generated by social media platforms allows us to leverage natural language processing (NLP) methods as valuable tools to understand the opioid crisis. OBJECTIVE: We aimed to understand how NLP has been applied to Reddit (Reddit Inc) data to study opioid use. METHODS: We systematically searched for peer-reviewed studies and conference abstracts in PubMed, Scopus, PsycINFO, ACL Anthology, IEEE Xplore, and Association for Computing Machinery data repositories up to July 19, 2022. Inclusion criteria were studies investigating opioid use, using NLP techniques to analyze the textual corpora, and using Reddit as the social media data source. We were specifically interested in mapping studies' overarching goals and findings, methodologies and software used, and main limitations. RESULTS: In total, 30 studies were included, which were classified into 4 nonmutually exclusive overarching goal categories: methodological (n=6, 20% studies), infodemiology (n=22, 73% studies), infoveillance (n=7, 23% studies), and pharmacovigilance (n=3, 10% studies). NLP methods were used to identify content relevant to opioid use among vast quantities of textual data, to establish potential relationships between opioid use patterns or profiles and contextual factors or comorbidities, and to anticipate individuals' transitions between different opioid-related subreddits, likely revealing progression through opioid use stages. Most studies used an embedding technique (12/30, 40%), prediction or classification approach (12/30, 40%), topic modeling (9/30, 30%), and sentiment analysis (6/30, 20%). The most frequently used programming languages were Python (20/30, 67%) and R (2/30, 7%). Among the studies that reported limitations (20/30, 67%), the most cited was the uncertainty regarding whether redditors participating in these forums were representative of people who use opioids (8/20, 40%). The papers were very recent (28/30, 93%), from 2019 to 2022, with authors from a range of disciplines. CONCLUSIONS: This scoping review identified a wide variety of NLP techniques and applications used to support surveillance and social media interventions addressing the opioid crisis. Despite the clear potential of these methods to enable the identification of opioid-relevant content in Reddit and its analysis, there are limits to the degree of interpretive meaning that they can provide. Moreover, we identified the need for standardized ethical guidelines to govern the use of Reddit data to safeguard the anonymity and privacy of people using these forums.


Assuntos
Processamento de Linguagem Natural , Mídias Sociais , Humanos , Transtornos Relacionados ao Uso de Opioides/epidemiologia , Analgésicos Opioides/efeitos adversos , Analgésicos Opioides/uso terapêutico
11.
BMC Bioinformatics ; 25(1): 301, 2024 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-39272021

RESUMO

Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of "words" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model's learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.


Assuntos
DNA , Humanos , DNA/química , Genoma Humano , Análise de Sequência de DNA/métodos , Processamento de Linguagem Natural , Biologia Computacional/métodos
12.
Bioinformatics ; 40(9)2024 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-39222004

RESUMO

MOTIVATION: Natural language is poised to become a key medium for human-machine interactions in the era of large language models. In the field of biochemistry, tasks such as property prediction and molecule mining are critically important yet technically challenging. Bridging molecular expressions in natural language and chemical language can significantly enhance the interpretability and ease of these tasks. Moreover, it can integrate chemical knowledge from various sources, leading to a deeper understanding of molecules. RESULTS: Recognizing these advantages, we introduce the concept of conversational molecular design, a novel task that utilizes natural language to describe and edit target molecules. To better accomplish this task, we develop ChatMol, a knowledgeable and versatile generative pretrained model. This model is enhanced by incorporating experimental property information, molecular spatial knowledge, and the associations between natural and chemical languages. Several typical solutions including large language models (e.g. ChatGPT) are evaluated, proving the challenge of conversational molecular design and the effectiveness of our knowledge enhancement approach. Case observations and analysis offer insights and directions for further exploration of natural-language interaction in molecular discovery. AVAILABILITY AND IMPLEMENTATION: Codes and data are provided in https://github.com/Ellenzzn/ChatMol/tree/main.


Assuntos
Processamento de Linguagem Natural , Humanos , Software , Biologia Computacional/métodos
13.
J Biomed Inform ; 157: 104720, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39233209

RESUMO

BACKGROUND: In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research. Particularly, extracting the information required for the pathological stage from surgical pathology reports can be utilized to update cancer staging according to the latest cancer staging guidelines. OBJECTIVES: This study has two main objectives. The first objective is to evaluate the performance of extracting information from text-based surgical pathology reports and determining pathological stages based on the extracted information using fine-tuned generative language models (GLMs) for patients with lung cancer. The second objective is to determine the feasibility of utilizing relatively small GLMs for information extraction in a resource-constrained computing environment. METHODS: Lung cancer surgical pathology reports were collected from the Common Data Model database of Seoul National University Bundang Hospital (SNUBH), a tertiary hospital in Korea. We selected 42 descriptors necessary for tumor-node (TN) classification based on these reports and created a gold standard with validation by two clinical experts. The pathology reports and gold standard were used to generate prompt-response pairs for training and evaluating GLMs which then were used to extract information required for staging from pathology reports. RESULTS: We evaluated the information extraction performance of six trained models as well as their performance in TN classification using the extracted information. The Deductive Mistral-7B model, which was pre-trained with the deductive dataset, showed the best performance overall, with an exact match ratio of 92.24% in the information extraction problem and an accuracy of 0.9876 (predicting T and N classification concurrently) in classification. CONCLUSION: This study demonstrated that training GLMs with deductive datasets can improve information extraction performance, and GLMs with a relatively small number of parameters at approximately seven billion can achieve high performance in this problem. The proposed GLM-based information extraction method is expected to be useful in clinical decision-making support, lung cancer staging and research.


Assuntos
Neoplasias Pulmonares , Processamento de Linguagem Natural , Estadiamento de Neoplasias , Neoplasias Pulmonares/patologia , Neoplasias Pulmonares/diagnóstico , Humanos , Estadiamento de Neoplasias/métodos , Registros Eletrônicos de Saúde , Mineração de Dados/métodos , Algoritmos , Bases de Dados Factuais
14.
Clin Imaging ; 114: 110271, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39236553

RESUMO

The advent of large language models (LLMs) marks a transformative leap in natural language processing, offering unprecedented potential in radiology, particularly in enhancing the accuracy and efficiency of coronary artery disease (CAD) diagnosis. While previous studies have explored the capabilities of specific LLMs like ChatGPT in cardiac imaging, a comprehensive evaluation comparing multiple LLMs in the context of CAD-RADS 2.0 has been lacking. This study addresses this gap by assessing the performance of various LLMs, including ChatGPT 4, ChatGPT 4o, Claude 3 Opus, Gemini 1.5 Pro, Mistral Large, Meta Llama 3 70B, and Perplexity Pro, in answering 30 multiple-choice questions derived from the CAD-RADS 2.0 guidelines. Our findings reveal that ChatGPT 4o achieved the highest accuracy at 100 %, with ChatGPT 4 and Claude 3 Opus closely following at 96.6 %. Other models, including Mistral Large, Perplexity Pro, Meta Llama 3 70B, and Gemini 1.5 Pro, also demonstrated commendable performance, though with slightly lower accuracy ranging from 90 % to 93.3 %. This study underscores the proficiency of current LLMs in understanding and applying CAD-RADS 2.0, suggesting their potential to significantly enhance radiological reporting and patient care in coronary artery disease. The variations in model performance highlight the need for further research, particularly in evaluating the visual diagnostic capabilities of LLMs-a critical component of radiology practice. This study provides a foundational comparison of LLMs in CAD-RADS 2.0 and sets the stage for future investigations into their broader applications in radiology, emphasizing the importance of integrating both text-based and visual knowledge for optimal clinical outcomes.


Assuntos
Angiografia por Tomografia Computadorizada , Angiografia Coronária , Doença da Artéria Coronariana , Processamento de Linguagem Natural , Humanos , Angiografia por Tomografia Computadorizada/métodos , Doença da Artéria Coronariana/diagnóstico por imagem , Angiografia Coronária/métodos , Reprodutibilidade dos Testes
15.
Database (Oxford) ; 20242024 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-39259689

RESUMO

This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask. Database URL: https://github.com/svassileva/symptemist-multilingual-linking.


Assuntos
Multilinguismo , Humanos , Processamento de Linguagem Natural , Unified Medical Language System
16.
PLoS One ; 19(9): e0310432, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39259727

RESUMO

Real-world data (RWD) in the medical field, such as electronic health records (EHRs) and medication orders, are receiving increasing attention from researchers and practitioners. While structured data have played a vital role thus far, unstructured data represented by text (e.g., discharge summaries) are not effectively utilized because of the difficulty in extracting medical information. We evaluated the information gained by supplementing structured data with clinical concepts extracted from unstructured text by leveraging natural language processing techniques. Using a machine learning-based pretrained named entity recognition tool, we extracted disease and medication names from real discharge summaries in a Japanese hospital and linked them to medical concepts using medical term dictionaries. By comparing the diseases and medications mentioned in the text with medical codes in tabular diagnosis records, we found that: (1) the text data contained richer information on patient symptoms than tabular diagnosis records, whereas the medication-order table stored more injection data than text. In addition, (2) extractable information regarding specific diseases showed surprisingly small intersections among text, diagnosis records, and medication orders. Text data can thus be a useful supplement for RWD mining, which is further demonstrated by (3) our practical application system for drug safety evaluation, which exhaustively visualizes suspicious adverse drug effects caused by the simultaneous use of anticancer drug pairs. We conclude that proper use of textual information extraction can lead to better outcomes in medical RWD mining.


Assuntos
Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Neoplasias , Humanos , Japão , Neoplasias/tratamento farmacológico , Mineração de Dados/métodos , Aprendizado de Máquina , População do Leste Asiático
17.
Sci Data ; 11(1): 992, 2024 Sep 11.
Artigo em Inglês | MEDLINE | ID: mdl-39261552

RESUMO

Currently, the field of neurobiology of language is based on data from only a few Indo-European languages. The majority of this data comes from younger adults neglecting other age groups. Here we present a multimodal database which consists of task-based and resting state fMRI, structural MRI, and EEG data while participants over 65 years old listened to sections of the story The Little Prince in Cantonese. We also provide data on participants' language history, lifetime experiences, linguistic and cognitive skills. Audio and text annotations, including time-aligned speech segmentation and prosodic information, as well as word-by-word predictors such as frequency and part-of-speech tagging derived from natural language processing (NLP) tools are included in this database. Both MRI and EEG data diagnostics revealed that the data has good quality. This multimodal database could advance our understanding of spatiotemporal dynamics of language comprehension in the older population and help us study the effects of healthy aging on the relationship between brain and behaviour.


Assuntos
Eletroencefalografia , Idioma , Imageamento por Ressonância Magnética , Humanos , Idoso , Encéfalo/diagnóstico por imagem , Encéfalo/fisiologia , Hong Kong , Processamento de Linguagem Natural , Bases de Dados Factuais
18.
Sci Rep ; 14(1): 21183, 2024 09 11.
Artigo em Inglês | MEDLINE | ID: mdl-39261578

RESUMO

Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies. Conclusions: Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https://github.com/chengpeng1116/MarkerGeneBERT .


Assuntos
Biomarcadores , Processamento de Linguagem Natural , Análise de Célula Única , Humanos , Animais , Análise de Célula Única/métodos , Camundongos , Análise de Sequência de RNA/métodos , Bases de Dados Genéticas , Biologia Computacional/métodos
19.
J Med Internet Res ; 26: e55591, 2024 Sep 11.
Artigo em Inglês | MEDLINE | ID: mdl-39259963

RESUMO

BACKGROUND: Social media posts that portray vaping in positive social contexts shape people's perceptions and serve to normalize vaping. Despite restrictions on depicting or promoting controlled substances, vape-related content is easily accessible on TikTok. There is a need to understand strategies used in promoting vaping on TikTok, especially among susceptible youth audiences. OBJECTIVE: This study seeks to comprehensively describe direct (ie, explicit promotional efforts) and indirect (ie, subtler strategies) themes promoting vaping on TikTok using a mixture of computational and qualitative thematic analyses of social media posts. In addition, we aim to describe how these themes might play a role in normalizing vaping behavior on TikTok for youth audiences, thereby informing public health communication and regulatory policies regarding vaping endorsements on TikTok. METHODS: We collected 14,002 unique TikTok posts using 50 vape-related hashtags (eg, #vapetok and #boxmod). Using the k-means unsupervised machine learning algorithm, we identified clusters and then categorized posts qualitatively based on themes. Next, we organized all videos from the posts thematically and extracted the visual features of each theme using 3 machine learning-based model architectures: residual network (ResNet) with 50 layers (ResNet50), Visual Geometry Group model with 16 layers, and vision transformer. We chose the best-performing model, ResNet50, to thoroughly analyze the image clustering output. To assess clustering accuracy, we examined 4.01% (441/10,990) of the samples from each video cluster. Finally, we randomly selected 50 videos (5% of the total videos) from each theme, which were qualitatively coded and compared with the machine-derived classification for validation. RESULTS: We successfully identified 5 major themes from the TikTok posts. Vape product marketing (1160/10,990, 8.28%) reflected direct marketing, while the other 4 themes reflected indirect marketing: TikTok influencer (3775/14,002, 26.96%), general vape (2741/14,002, 19.58%), vape brands (2042/14,002, 14.58%), and vaping cessation (1272/14,002, 9.08%). The ResNet50 model successfully classified clusters based on image features, achieving an average F1-score of 0.97, the highest among the 3 models. Qualitative content analyses indicated that vaping was depicted as a normal, routine part of daily life, with TikTok influencers subtly incorporating vaping into popular culture (eg, gaming, skateboarding, and tattooing) and social practices (eg, shopping sprees, driving, and grocery shopping). CONCLUSIONS: The results from both computational and qualitative analyses of text and visual data reveal that vaping is normalized on TikTok. Our identified themes underscore how everyday conversations, promotional content, and the influence of popular figures collectively contribute to depicting vaping as a normal and accepted aspect of daily life on TikTok. Our study provides valuable insights for regulatory policies and public health initiatives aimed at tackling the normalization of vaping on social media platforms.


Assuntos
Processamento de Linguagem Natural , Mídias Sociais , Vaping , Vaping/psicologia , Humanos , Adolescente , Pesquisa Qualitativa
20.
Psychiatry Res ; 341: 116119, 2024 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-39226873

RESUMO

Natural Language Processing (NLP) methods have shown promise for the assessment of formal thought disorder, a hallmark feature of schizophrenia in which disturbances to the structure, organization, or coherence of thought can manifest as disordered or incoherent speech. We investigated the suitability of modern Large Language Models (LLMs - e.g., GPT-3.5, GPT-4, and Llama 3) to predict expert-generated ratings for three dimensions of thought disorder (coherence, content, and tangentiality) assigned to speech samples collected from both patients with a diagnosis of schizophrenia (n = 26) and healthy control participants (n = 25). In addition to (1) evaluating the accuracy of LLM-generated ratings relative to human experts, we also (2) investigated the degree to which the LLMs produced consistent ratings across multiple trials, and we (3) sought to understand the factors that impacted the consistency of LLM-generated output. We found that machine-generated ratings of the level of thought disorder in speech matched favorably those of expert humans, and we identified a tradeoff between accuracy and consistency in LLM ratings. Unlike traditional NLP methods, LLMs were not always consistent in their predictions, but these inconsistencies could be mitigated with careful parameter selection and ensemble methods. We discuss implications for NLP-based assessment of thought disorder and provide recommendations of best practices for integrating these methods in the field of psychiatry.


Assuntos
Processamento de Linguagem Natural , Esquizofrenia , Pensamento , Humanos , Feminino , Esquizofrenia/diagnóstico , Esquizofrenia/fisiopatologia , Masculino , Adulto , Pensamento/fisiologia , Pessoa de Meia-Idade , Psicologia do Esquizofrênico
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA