Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 7.150
Filtrar
1.
JMIR Infodemiology ; 4: e51156, 2024 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-39269743

RESUMO

BACKGROUND: The growing availability of big data spontaneously generated by social media platforms allows us to leverage natural language processing (NLP) methods as valuable tools to understand the opioid crisis. OBJECTIVE: We aimed to understand how NLP has been applied to Reddit (Reddit Inc) data to study opioid use. METHODS: We systematically searched for peer-reviewed studies and conference abstracts in PubMed, Scopus, PsycINFO, ACL Anthology, IEEE Xplore, and Association for Computing Machinery data repositories up to July 19, 2022. Inclusion criteria were studies investigating opioid use, using NLP techniques to analyze the textual corpora, and using Reddit as the social media data source. We were specifically interested in mapping studies' overarching goals and findings, methodologies and software used, and main limitations. RESULTS: In total, 30 studies were included, which were classified into 4 nonmutually exclusive overarching goal categories: methodological (n=6, 20% studies), infodemiology (n=22, 73% studies), infoveillance (n=7, 23% studies), and pharmacovigilance (n=3, 10% studies). NLP methods were used to identify content relevant to opioid use among vast quantities of textual data, to establish potential relationships between opioid use patterns or profiles and contextual factors or comorbidities, and to anticipate individuals' transitions between different opioid-related subreddits, likely revealing progression through opioid use stages. Most studies used an embedding technique (12/30, 40%), prediction or classification approach (12/30, 40%), topic modeling (9/30, 30%), and sentiment analysis (6/30, 20%). The most frequently used programming languages were Python (20/30, 67%) and R (2/30, 7%). Among the studies that reported limitations (20/30, 67%), the most cited was the uncertainty regarding whether redditors participating in these forums were representative of people who use opioids (8/20, 40%). The papers were very recent (28/30, 93%), from 2019 to 2022, with authors from a range of disciplines. CONCLUSIONS: This scoping review identified a wide variety of NLP techniques and applications used to support surveillance and social media interventions addressing the opioid crisis. Despite the clear potential of these methods to enable the identification of opioid-relevant content in Reddit and its analysis, there are limits to the degree of interpretive meaning that they can provide. Moreover, we identified the need for standardized ethical guidelines to govern the use of Reddit data to safeguard the anonymity and privacy of people using these forums.


Assuntos
Processamento de Linguagem Natural , Mídias Sociais , Humanos , Transtornos Relacionados ao Uso de Opioides/epidemiologia , Analgésicos Opioides/efeitos adversos , Analgésicos Opioides/uso terapêutico
2.
BMC Bioinformatics ; 25(1): 301, 2024 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-39272021

RESUMO

Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of "words" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model's learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.


Assuntos
DNA , Humanos , DNA/química , Genoma Humano , Análise de Sequência de DNA/métodos , Processamento de Linguagem Natural , Biologia Computacional/métodos
3.
Semin Vasc Surg ; 37(3): 314-320, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39277347

RESUMO

Natural language processing is a subfield of artificial intelligence that aims to analyze human oral or written language. The development of large language models has brought innovative perspectives in medicine, including the potential use of chatbots and virtual assistants. Nevertheless, the benefits and pitfalls of such technology need to be carefully evaluated before their use in health care. The aim of this narrative review was to provide an overview of potential applications of large language models and artificial intelligence chatbots in the field of vascular surgery, including clinical practice, research, and education. In light of the results, we discuss current limits and future directions.


Assuntos
Inteligência Artificial , Processamento de Linguagem Natural , Procedimentos Cirúrgicos Vasculares , Humanos
4.
J Biomed Semantics ; 15(1): 17, 2024 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-39277770

RESUMO

BACKGROUND: Natural language processing (NLP) is increasingly being used to extract structured information from unstructured text to assist clinical decision-making and aid healthcare research. The availability of expert-annotated documents for the development and validation of NLP applications is limited. We created synthetic clinical documents to address this, and to validate the Extraction of Epilepsy Clinical Text version 2 (ExECTv2) NLP pipeline. METHODS: We created 200 synthetic clinic letters based on hospital outpatient consultations with epilepsy specialists. The letters were double annotated by trained clinicians and researchers according to agreed guidelines. We used the annotation tool, Markup, with an epilepsy concept list based on the Unified Medical Language System ontology. All annotations were reviewed, and a gold standard set of annotations was agreed and used to validate the performance of ExECTv2. RESULTS: The overall inter-annotator agreement (IAA) between the two sets of annotations produced a per item F1 score of 0.73. Validating ExECTv2 using the gold standard gave an overall F1 score of 0.87 per item, and 0.90 per letter. CONCLUSION: The synthetic letters, annotations, and annotation guidelines have been made freely available. To our knowledge, this is the first publicly available set of annotated epilepsy clinic letters and guidelines that can be used for NLP researchers with minimum epilepsy knowledge. The IAA results show that clinical text annotation tasks are difficult and require a gold standard to be arranged by researcher consensus. The results for ExECTv2, our automated epilepsy NLP pipeline, extracted detailed epilepsy information from unstructured epilepsy letters with more accuracy than human annotators, further confirming the utility of NLP for clinical and research applications.


Assuntos
Epilepsia , Processamento de Linguagem Natural , Humanos , Curadoria de Dados/métodos
5.
Perm J ; 28(3): 23-36, 2024 Sep 16.
Artigo em Inglês | MEDLINE | ID: mdl-39219312

RESUMO

INTRODUCTION: Rapid identification of individuals developing a psychotic spectrum disorder (PSD) is crucial because untreated psychosis is associated with poor outcomes and decreased treatment response. Lack of recognition of early psychotic symptoms often delays diagnosis, further worsening these outcomes. METHODS: The proposed study is a cross-sectional, retrospective analysis of electronic health record data including clinician documentation and patient-clinician secure messages for patients aged 15-29 years with ≥ 1 primary care encounter between 2017 and 2019 within 2 Kaiser Permanente regions. Patients with new-onset PSD will be distinguished from those without a diagnosis if they have ≥ 1 PSD diagnosis within 12 months following the primary care encounter. The prediction model will be trained using a trisourced natural language processing feature extraction design and validated both within each region separately and in a modified combined sample. DISCUSSION: This proposed model leverages the strengths of the large volume of patient-specific data from an integrated electronic health record with natural language processing to identify patients at elevated chance of developing a PSD. This project carries the potential to reduce the duration of untreated psychosis and thereby improve long-term patient outcomes.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Processamento de Linguagem Natural , Transtornos Psicóticos , Humanos , Transtornos Psicóticos/diagnóstico , Adolescente , Estudos Retrospectivos , Adulto , Adulto Jovem , Estudos Transversais , Masculino , Feminino , Atenção Primária à Saúde
6.
Bioinformatics ; 40(9)2024 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-39222004

RESUMO

MOTIVATION: Natural language is poised to become a key medium for human-machine interactions in the era of large language models. In the field of biochemistry, tasks such as property prediction and molecule mining are critically important yet technically challenging. Bridging molecular expressions in natural language and chemical language can significantly enhance the interpretability and ease of these tasks. Moreover, it can integrate chemical knowledge from various sources, leading to a deeper understanding of molecules. RESULTS: Recognizing these advantages, we introduce the concept of conversational molecular design, a novel task that utilizes natural language to describe and edit target molecules. To better accomplish this task, we develop ChatMol, a knowledgeable and versatile generative pretrained model. This model is enhanced by incorporating experimental property information, molecular spatial knowledge, and the associations between natural and chemical languages. Several typical solutions including large language models (e.g. ChatGPT) are evaluated, proving the challenge of conversational molecular design and the effectiveness of our knowledge enhancement approach. Case observations and analysis offer insights and directions for further exploration of natural-language interaction in molecular discovery. AVAILABILITY AND IMPLEMENTATION: Codes and data are provided in https://github.com/Ellenzzn/ChatMol/tree/main.


Assuntos
Processamento de Linguagem Natural , Humanos , Software , Biologia Computacional/métodos
7.
Sci Data ; 11(1): 982, 2024 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-39251610

RESUMO

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.


Assuntos
Enzimas , Processamento de Linguagem Natural , Enzimas/química , PubMed , Bases de Dados de Proteínas , Bases de Conhecimento
8.
JCO Clin Cancer Inform ; 8: e2400134, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39265122

RESUMO

PURPOSE: Data on end-of-life care (EOLC) quality, assessed through evidence-based quality measures (QMs), are difficult to obtain. Natural language processing (NLP) enables efficient quality measurement and is not yet used for children with serious illness. We sought to validate a pediatric-specific EOLC-QM keyword library and evaluate EOLC-QM attainment among childhood cancer decedents. METHODS: In a single-center cohort of children with cancer who died between 2014 and 2022, we piloted a rule-based NLP approach to examine the content of clinical notes in the last 6 months of life. We identified documented discussions of five EOLC-QMs: goals of care, limitations to life-sustaining treatments (LLST), hospice, palliative care consultation, and preferred location of death. We assessed performance of NLP methods, compared with gold standard manual chart review. We then used NLP to characterize proportions of decedents with documented EOLC-QM discussions and timing of first documentation relative to death. RESULTS: Among 101 decedents, nearly half were minorities (Hispanic/Latinx [24%], non-Hispanic Black/African American [20%]), female (48%), or diagnosed with solid tumors (43%). Through iterative refinement, our keyword library achieved robust performance statistics (for all EOLC-QMs, F1 score = 1.0). Most decedents had documented discussions regarding goals of care (83%), LLST (83%), and hospice (74%). Fewer decedents had documented discussions regarding palliative care consultation (49%) or preferred location of death (36%). For all five EOLC-QMs, first documentation occurred, on average, >30 days before death. CONCLUSION: A high proportion of decedents attained specified EOLC-QMs more than 30 days before death. Our findings indicate that NLP is a feasible approach to measuring quality of care for children with cancer at the end of life and is ripe for multi-center research and quality improvement.


Assuntos
Processamento de Linguagem Natural , Neoplasias , Assistência Terminal , Humanos , Assistência Terminal/normas , Assistência Terminal/métodos , Neoplasias/terapia , Neoplasias/mortalidade , Criança , Feminino , Masculino , Pré-Escolar , Adolescente , Qualidade da Assistência à Saúde , Lactente , Cuidados Paliativos/métodos , Cuidados Paliativos/normas , Recém-Nascido , Registros Eletrônicos de Saúde
9.
Bioinformatics ; 40(Suppl 2): ii45-ii52, 2024 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-39230709

RESUMO

MOTIVATION: Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. RESULTS: In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%). AVAILABILITY AND IMPLEMENTATION: All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.


Assuntos
Aprendizado Profundo , Bases de Dados Factuais , Dicionários como Assunto , Biologia Computacional/métodos , Processamento de Linguagem Natural , Mineração de Dados/métodos
10.
Sci Rep ; 14(1): 20994, 2024 09 09.
Artigo em Inglês | MEDLINE | ID: mdl-39251659

RESUMO

Sound recognition is effortless for humans but poses a significant challenge for artificial hearing systems. Deep neural networks (DNNs), especially convolutional neural networks (CNNs), have recently surpassed traditional machine learning in sound classification. However, current DNNs map sounds to labels using binary categorical variables, neglecting the semantic relations between labels. Cognitive neuroscience research suggests that human listeners exploit such semantic information besides acoustic cues. Hence, our hypothesis is that incorporating semantic information improves DNN's sound recognition performance, emulating human behaviour. In our approach, sound recognition is framed as a regression problem, with CNNs trained to map spectrograms to continuous semantic representations from NLP models (Word2Vec, BERT, and CLAP text encoder). Two DNN types were trained: semDNN with continuous embeddings and catDNN with categorical labels, both with a dataset extracted from a collection of 388,211 sounds enriched with semantic descriptions. Evaluations across four external datasets, confirmed the superiority of semantic labeling from semDNN compared to catDNN, preserving higher-level relations. Importantly, an analysis of human similarity ratings for natural sounds, showed that semDNN approximated human listener behaviour better than catDNN, other DNNs, and NLP models. Our work contributes to understanding the role of semantics in sound recognition, bridging the gap between artificial systems and human auditory perception.


Assuntos
Percepção Auditiva , Processamento de Linguagem Natural , Redes Neurais de Computação , Semântica , Humanos , Percepção Auditiva/fisiologia , Aprendizado Profundo , Som
11.
Psychiatry Res ; 341: 116119, 2024 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-39226873

RESUMO

Natural Language Processing (NLP) methods have shown promise for the assessment of formal thought disorder, a hallmark feature of schizophrenia in which disturbances to the structure, organization, or coherence of thought can manifest as disordered or incoherent speech. We investigated the suitability of modern Large Language Models (LLMs - e.g., GPT-3.5, GPT-4, and Llama 3) to predict expert-generated ratings for three dimensions of thought disorder (coherence, content, and tangentiality) assigned to speech samples collected from both patients with a diagnosis of schizophrenia (n = 26) and healthy control participants (n = 25). In addition to (1) evaluating the accuracy of LLM-generated ratings relative to human experts, we also (2) investigated the degree to which the LLMs produced consistent ratings across multiple trials, and we (3) sought to understand the factors that impacted the consistency of LLM-generated output. We found that machine-generated ratings of the level of thought disorder in speech matched favorably those of expert humans, and we identified a tradeoff between accuracy and consistency in LLM ratings. Unlike traditional NLP methods, LLMs were not always consistent in their predictions, but these inconsistencies could be mitigated with careful parameter selection and ensemble methods. We discuss implications for NLP-based assessment of thought disorder and provide recommendations of best practices for integrating these methods in the field of psychiatry.


Assuntos
Processamento de Linguagem Natural , Esquizofrenia , Pensamento , Humanos , Feminino , Esquizofrenia/diagnóstico , Esquizofrenia/fisiopatologia , Masculino , Adulto , Pensamento/fisiologia , Pessoa de Meia-Idade , Psicologia do Esquizofrênico
12.
JMIR Ment Health ; 11: e58259, 2024 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-39233477

RESUMO

Background: Depression represents a pressing global public health concern, impacting the physical and mental well-being of hundreds of millions worldwide. Notwithstanding advances in clinical practice, an alarming number of individuals at risk for depression continue to face significant barriers to timely diagnosis and effective treatment, thereby exacerbating a burgeoning social health crisis. Objective: This study seeks to develop a novel online depression risk detection method using natural language processing technology to identify individuals at risk of depression on the Chinese social media platform Sina Weibo. Methods: First, we collected approximately 527,333 posts publicly shared over 1 year from 1600 individuals with depression and 1600 individuals without depression on the Sina Weibo platform. We then developed a hierarchical transformer network for learning user-level semantic representations, which consists of 3 primary components: a word-level encoder, a post-level encoder, and a semantic aggregation encoder. The word-level encoder learns semantic embeddings from individual posts, while the post-level encoder explores features in user post sequences. The semantic aggregation encoder aggregates post sequence semantics to generate a user-level semantic representation that can be classified as depressed or nondepressed. Next, a classifier is employed to predict the risk of depression. Finally, we conducted statistical and linguistic analyses of the post content from individuals with and without depression using the Chinese Linguistic Inquiry and Word Count. Results: We divided the original data set into training, validation, and test sets. The training set consisted of 1000 individuals with depression and 1000 individuals without depression. Similarly, each validation and test set comprised 600 users, with 300 individuals from both cohorts (depression and nondepression). Our method achieved an accuracy of 84.62%, precision of 84.43%, recall of 84.50%, and F1-score of 84.32% on the test set without employing sampling techniques. However, by applying our proposed retrieval-based sampling strategy, we observed significant improvements in performance: an accuracy of 95.46%, precision of 95.30%, recall of 95.70%, and F1-score of 95.43%. These outstanding results clearly demonstrate the effectiveness and superiority of our proposed depression risk detection model and retrieval-based sampling technique. This breakthrough provides new insights for large-scale depression detection through social media. Through language behavior analysis, we discovered that individuals with depression are more likely to use negation words (the value of "swear" is 0.001253). This may indicate the presence of negative emotions, rejection, doubt, disagreement, or aversion in individuals with depression. Additionally, our analysis revealed that individuals with depression tend to use negative emotional vocabulary in their expressions ("NegEmo": 0.022306; "Anx": 0.003829; "Anger": 0.004327; "Sad": 0.005740), which may reflect their internal negative emotions and psychological state. This frequent use of negative vocabulary could be a way for individuals with depression to express negative feelings toward life, themselves, or their surrounding environment. Conclusions: The research results indicate the feasibility and effectiveness of using deep learning methods to detect the risk of depression. These findings provide insights into the potential for large-scale, automated, and noninvasive prediction of depression among online social media users.


Assuntos
Depressão , Processamento de Linguagem Natural , Mídias Sociais , Humanos , Depressão/diagnóstico , Depressão/psicologia , Depressão/epidemiologia , Mídias Sociais/estatística & dados numéricos , China/epidemiologia , Semântica , Medição de Risco/métodos
13.
J Med Internet Res ; 26: e55591, 2024 Sep 11.
Artigo em Inglês | MEDLINE | ID: mdl-39259963

RESUMO

BACKGROUND: Social media posts that portray vaping in positive social contexts shape people's perceptions and serve to normalize vaping. Despite restrictions on depicting or promoting controlled substances, vape-related content is easily accessible on TikTok. There is a need to understand strategies used in promoting vaping on TikTok, especially among susceptible youth audiences. OBJECTIVE: This study seeks to comprehensively describe direct (ie, explicit promotional efforts) and indirect (ie, subtler strategies) themes promoting vaping on TikTok using a mixture of computational and qualitative thematic analyses of social media posts. In addition, we aim to describe how these themes might play a role in normalizing vaping behavior on TikTok for youth audiences, thereby informing public health communication and regulatory policies regarding vaping endorsements on TikTok. METHODS: We collected 14,002 unique TikTok posts using 50 vape-related hashtags (eg, #vapetok and #boxmod). Using the k-means unsupervised machine learning algorithm, we identified clusters and then categorized posts qualitatively based on themes. Next, we organized all videos from the posts thematically and extracted the visual features of each theme using 3 machine learning-based model architectures: residual network (ResNet) with 50 layers (ResNet50), Visual Geometry Group model with 16 layers, and vision transformer. We chose the best-performing model, ResNet50, to thoroughly analyze the image clustering output. To assess clustering accuracy, we examined 4.01% (441/10,990) of the samples from each video cluster. Finally, we randomly selected 50 videos (5% of the total videos) from each theme, which were qualitatively coded and compared with the machine-derived classification for validation. RESULTS: We successfully identified 5 major themes from the TikTok posts. Vape product marketing (1160/10,990, 8.28%) reflected direct marketing, while the other 4 themes reflected indirect marketing: TikTok influencer (3775/14,002, 26.96%), general vape (2741/14,002, 19.58%), vape brands (2042/14,002, 14.58%), and vaping cessation (1272/14,002, 9.08%). The ResNet50 model successfully classified clusters based on image features, achieving an average F1-score of 0.97, the highest among the 3 models. Qualitative content analyses indicated that vaping was depicted as a normal, routine part of daily life, with TikTok influencers subtly incorporating vaping into popular culture (eg, gaming, skateboarding, and tattooing) and social practices (eg, shopping sprees, driving, and grocery shopping). CONCLUSIONS: The results from both computational and qualitative analyses of text and visual data reveal that vaping is normalized on TikTok. Our identified themes underscore how everyday conversations, promotional content, and the influence of popular figures collectively contribute to depicting vaping as a normal and accepted aspect of daily life on TikTok. Our study provides valuable insights for regulatory policies and public health initiatives aimed at tackling the normalization of vaping on social media platforms.


Assuntos
Processamento de Linguagem Natural , Mídias Sociais , Vaping , Vaping/psicologia , Humanos , Adolescente , Pesquisa Qualitativa
14.
Database (Oxford) ; 20242024 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-39259689

RESUMO

This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask. Database URL: https://github.com/svassileva/symptemist-multilingual-linking.


Assuntos
Multilinguismo , Humanos , Processamento de Linguagem Natural , Unified Medical Language System
15.
PLoS One ; 19(9): e0310432, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39259727

RESUMO

Real-world data (RWD) in the medical field, such as electronic health records (EHRs) and medication orders, are receiving increasing attention from researchers and practitioners. While structured data have played a vital role thus far, unstructured data represented by text (e.g., discharge summaries) are not effectively utilized because of the difficulty in extracting medical information. We evaluated the information gained by supplementing structured data with clinical concepts extracted from unstructured text by leveraging natural language processing techniques. Using a machine learning-based pretrained named entity recognition tool, we extracted disease and medication names from real discharge summaries in a Japanese hospital and linked them to medical concepts using medical term dictionaries. By comparing the diseases and medications mentioned in the text with medical codes in tabular diagnosis records, we found that: (1) the text data contained richer information on patient symptoms than tabular diagnosis records, whereas the medication-order table stored more injection data than text. In addition, (2) extractable information regarding specific diseases showed surprisingly small intersections among text, diagnosis records, and medication orders. Text data can thus be a useful supplement for RWD mining, which is further demonstrated by (3) our practical application system for drug safety evaluation, which exhaustively visualizes suspicious adverse drug effects caused by the simultaneous use of anticancer drug pairs. We conclude that proper use of textual information extraction can lead to better outcomes in medical RWD mining.


Assuntos
Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Neoplasias , Humanos , Japão , Neoplasias/tratamento farmacológico , Mineração de Dados/métodos , Aprendizado de Máquina , População do Leste Asiático
16.
Sci Data ; 11(1): 992, 2024 Sep 11.
Artigo em Inglês | MEDLINE | ID: mdl-39261552

RESUMO

Currently, the field of neurobiology of language is based on data from only a few Indo-European languages. The majority of this data comes from younger adults neglecting other age groups. Here we present a multimodal database which consists of task-based and resting state fMRI, structural MRI, and EEG data while participants over 65 years old listened to sections of the story The Little Prince in Cantonese. We also provide data on participants' language history, lifetime experiences, linguistic and cognitive skills. Audio and text annotations, including time-aligned speech segmentation and prosodic information, as well as word-by-word predictors such as frequency and part-of-speech tagging derived from natural language processing (NLP) tools are included in this database. Both MRI and EEG data diagnostics revealed that the data has good quality. This multimodal database could advance our understanding of spatiotemporal dynamics of language comprehension in the older population and help us study the effects of healthy aging on the relationship between brain and behaviour.


Assuntos
Eletroencefalografia , Idioma , Imageamento por Ressonância Magnética , Humanos , Idoso , Encéfalo/diagnóstico por imagem , Encéfalo/fisiologia , Hong Kong , Processamento de Linguagem Natural , Bases de Dados Factuais
17.
Sci Rep ; 14(1): 21183, 2024 09 11.
Artigo em Inglês | MEDLINE | ID: mdl-39261578

RESUMO

Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies. Conclusions: Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https://github.com/chengpeng1116/MarkerGeneBERT .


Assuntos
Biomarcadores , Processamento de Linguagem Natural , Análise de Célula Única , Humanos , Animais , Análise de Célula Única/métodos , Camundongos , Análise de Sequência de RNA/métodos , Bases de Dados Genéticas , Biologia Computacional/métodos
19.
J Med Internet Res ; 26: e60501, 2024 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-39255030

RESUMO

BACKGROUND: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. OBJECTIVE: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. METHODS: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). RESULTS: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. CONCLUSIONS: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.


Assuntos
Processamento de Linguagem Natural , Humanos , Informática Médica/métodos
20.
PLoS One ; 19(9): e0305496, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39241041

RESUMO

Narratives posted on the internet by patients contain a vast amount of information about various concerns. This study aimed to extract multiple concerns from interviews with breast cancer patients using the natural language processing (NLP) model bidirectional encoder representations from transformers (BERT). A total of 508 interview transcriptions of breast cancer patients written in Japanese were labeled with five types of concern labels: "treatment," "physical," "psychological," "work/financial," and "family/friends." The labeled texts were used to create a multi-label classifier by fine-tuning a pre-trained BERT model. Prior to fine-tuning, we also created several classifiers with domain adaptation using (1) breast cancer patients' blog articles and (2) breast cancer patients' interview transcriptions. The performance of the classifiers was evaluated in terms of precision through 5-fold cross-validation. The multi-label classifiers with only fine-tuning had precision values of over 0.80 for "physical" and "work/financial" out of the five concerns. On the other hand, precision for "treatment" was low at approximately 0.25. However, for the classifiers using domain adaptation, the precision of this label took a range of 0.40-0.51, with some cases improving by more than 0.2. This study showed combining domain adaptation with a multi-label classifier on target data made it possible to efficiently extract multiple concerns from interviews.


Assuntos
Neoplasias da Mama , Processamento de Linguagem Natural , Humanos , Neoplasias da Mama/psicologia , Feminino , Narração
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA