Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 418
Filtrar
1.
Heliyon ; 10(12): e32093, 2024 Jun 30.
Artigo em Inglês | MEDLINE | ID: mdl-38948047

RESUMO

Chinese agricultural named entity recognition (NER) has been studied with supervised learning for many years. However, considering the scarcity of public datasets in the agricultural domain, exploring this task in the few-shot scenario is more practical for real-world demands. In this paper, we propose a novel model named GlyReShot, integrating the knowledge of Chinese character glyph into few-shot NER models. Although the utilization of glyph has been proven successful in supervised models, two challenges still persist in the few-shot setting, i.e., how to obtain glyph representations and when to integrate them into the few-shot model. GlyReShot handles the two challenges by introducing a lightweight glyph representation obtaining module and a training-free label refinement strategy. Specifically, the glyph representations are generated based on the descriptive sentences by filling the predefined template. As most steps come before training, this module aligns well with the few-shot setting. Furthermore, by computing the confidence values for draft predictions, the refinement strategy selectively utilizes the glyph information only when the confidence values are relatively low, thus mitigating the influence of noise. Finally, we annotate a new agricultural NER dataset and the experimental results demonstrate effectiveness of GlyReShot for few-shot Chinese agricultural NER.

2.
PeerJ ; 12: e17470, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38948230

RESUMO

TIN-X (Target Importance and Novelty eXplorer) is an interactive visualization tool for illuminating associations between diseases and potential drug targets and is publicly available at newdrugtargets.org. TIN-X uses natural language processing to identify disease and protein mentions within PubMed content using previously published tools for named entity recognition (NER) of gene/protein and disease names. Target data is obtained from the Target Central Resource Database (TCRD). Two important metrics, novelty and importance, are computed from this data and when plotted as log(importance) vs. log(novelty), aid the user in visually exploring the novelty of drug targets and their associated importance to diseases. TIN-X Version 3.0 has been significantly improved with an expanded dataset, modernized architecture including a REST API, and an improved user interface (UI). The dataset has been expanded to include not only PubMed publication titles and abstracts, but also full-text articles when available. This results in approximately 9-fold more target/disease associations compared to previous versions of TIN-X. Additionally, the TIN-X database containing this expanded dataset is now hosted in the cloud via Amazon RDS. Recent enhancements to the UI focuses on making it more intuitive for users to find diseases or drug targets of interest while providing a new, sortable table-view mode to accompany the existing plot-view mode. UI improvements also help the user browse the associated PubMed publications to explore and understand the basis of TIN-X's predicted association between a specific disease and a target of interest. While implementing these upgrades, computational resources are balanced between the webserver and the user's web browser to achieve adequate performance while accommodating the expanded dataset. Together, these advances aim to extend the duration that users can benefit from TIN-X while providing both an expanded dataset and new features that researchers can use to better illuminate understudied proteins.


Assuntos
Interface Usuário-Computador , Humanos , Processamento de Linguagem Natural , PubMed , Software
3.
J Cheminform ; 16(1): 76, 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38956728

RESUMO

Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science.Scientific contributionWe have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks.

4.
BMC Med Inform Decis Mak ; 24(1): 192, 2024 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-38982465

RESUMO

BACKGROUND: As global aging intensifies, the prevalence of ocular fundus diseases continues to rise. In China, the tense doctor-patient ratio poses numerous challenges for the early diagnosis and treatment of ocular fundus diseases. To reduce the high risk of missed or misdiagnosed cases, avoid irreversible visual impairment for patients, and ensure good visual prognosis for patients with ocular fundus diseases, it is particularly important to enhance the growth and diagnostic capabilities of junior doctors. This study aims to leverage the value of electronic medical record data to developing a diagnostic intelligent decision support platform. This platform aims to assist junior doctors in diagnosing ocular fundus diseases quickly and accurately, expedite their professional growth, and prevent delays in patient treatment. An empirical evaluation will assess the platform's effectiveness in enhancing doctors' diagnostic efficiency and accuracy. METHODS: In this study, eight Chinese Named Entity Recognition (NER) models were compared, and the SoftLexicon-Glove-Word2vec model, achieving a high F1 score of 93.02%, was selected as the optimal recognition tool. This model was then used to extract key information from electronic medical records (EMRs) and generate feature variables based on diagnostic rule templates. Subsequently, an XGBoost algorithm was employed to construct an intelligent decision support platform for diagnosing ocular fundus diseases. The effectiveness of the platform in improving diagnostic efficiency and accuracy was evaluated through a controlled experiment comparing experienced and junior doctors. RESULTS: The use of the diagnostic intelligent decision support platform resulted in significant improvements in both diagnostic efficiency and accuracy for both experienced and junior doctors (P < 0.05). Notably, the gap in diagnostic speed and precision between junior doctors and experienced doctors narrowed considerably when the platform was used. Although the platform also provided some benefits to experienced doctors, the improvement was less pronounced compared to junior doctors. CONCLUSION: The diagnostic intelligent decision support platform established in this study, based on the XGBoost algorithm and NER, effectively enhances the diagnostic efficiency and accuracy of junior doctors in ocular fundus diseases. This has significant implications for optimizing clinical diagnosis and treatment.


Assuntos
Oftalmologistas , Humanos , Tomada de Decisão Clínica , Registros Eletrônicos de Saúde/normas , Inteligência Artificial , China , Sistemas de Apoio a Decisões Clínicas
5.
Sci Rep ; 14(1): 16106, 2024 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-38997330

RESUMO

The Span-based model can effectively capture the complex entity structure in the text, thus becoming the mainstream model for nested named entity recognition (Nested NER) tasks. However, traditional Span-based models decode each entity span independently. They do not consider the semantic connections between spans or the entities' positional information, which limits their performance. To address these issues, we propose a Bi-Directional Context-Aware Network (Bi-DCAN) for the Nested NER. Specifically, we first design a new span-level semantic relation model. Then, the Bi-DCAN is implemented to capture this semantic relationship. Furthermore, we incorporate Rotary Position Embedding into the bi-affine mechanism to capture the relative positional information between the head and tail tokens, enabling the model to more accurately determine the position of each entity. Experimental results show that compared to the latest model Diffusion-NER, our model reduces 20M parameters and increases the F1 scores by 0.24 and 0.09 on the ACE2005 and GENIA datasets respectively, which proves that our model has an excellent ability to recognise nested entities.

6.
JMIR Med Inform ; 12: e59680, 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38954456

RESUMO

BACKGROUND: Named entity recognition (NER) is a fundamental task in natural language processing. However, it is typically preceded by named entity annotation, which poses several challenges, especially in the clinical domain. For instance, determining entity boundaries is one of the most common sources of disagreements between annotators due to questions such as whether modifiers or peripheral words should be annotated. If unresolved, these can induce inconsistency in the produced corpora, yet, on the other hand, strict guidelines or adjudication sessions can further prolong an already slow and convoluted process. OBJECTIVE: The aim of this study is to address these challenges by evaluating 2 novel annotation methodologies, lenient span and point annotation, aiming to mitigate the difficulty of precisely determining entity boundaries. METHODS: We evaluate their effects through an annotation case study on a Japanese medical case report data set. We compare annotation time, annotator agreement, and the quality of the produced labeling and assess the impact on the performance of an NER system trained on the annotated corpus. RESULTS: We saw significant improvements in the labeling process efficiency, with up to a 25% reduction in overall annotation time and even a 10% improvement in annotator agreement compared to the traditional boundary-strict approach. However, even the best-achieved NER model presented some drop in performance compared to the traditional annotation methodology. CONCLUSIONS: Our findings demonstrate a balance between annotation speed and model performance. Although disregarding boundary information affects model performance to some extent, this is counterbalanced by significant reductions in the annotator's workload and notable improvements in the speed of the annotation process. These benefits may prove valuable in various applications, offering an attractive compromise for developers and researchers.

7.
Front Plant Sci ; 15: 1368847, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38984153

RESUMO

Introduction: The diversity of edible fungus species and the extent of mycological knowledge pose significant challenges to the research, cultivation, and popularization of edible fungus. To tackle this challenge, there is an urgent need for a rapid and accurate method of acquiring relevant information. The emergence of question and answer (Q&A) systems has the potential to solve this problem. Named entity recognition (NER) provides the basis for building an intelligent Q&A system for edible fungus. In the field of edible fungus, there is a lack of a publicly available Chinese corpus suitable for use in NER, and conventional methods struggle to capture long-distance dependencies in the NER process. Methods: This paper describes the establishment of a Chinese corpus in the field of edible fungus and introduces an NER method for edible fungus information based on XLNet and conditional random fields (CRFs). Our approach combines an iterated dilated convolutional neural network (IDCNN) with a CRF. First, leveraging the XLNet model as the foundation, an IDCNN layer is introduced. This layer addresses the limited capacity to capture features across utterances by extending the receptive field of the convolutional kernel. The output of the IDCNN layer is input to the CRF layer, which mitigates any labeling logic errors, resulting in the globally optimal labels for the NER task relating to edible fungus. Results: Experimental results show that the precision achieved by the proposed model reaches 0.971, with a recall of 0.986 and an F1-score of 0.979. Discussion: The proposed model outperforms existing approaches in terms of these evaluation metrics, effectively recognizing entities related to edible fungus information and offering methodological support for the construction of knowledge graphs.

8.
JMIR Form Res ; 8: e54044, 2024 Jul 10.
Artigo em Inglês | MEDLINE | ID: mdl-38986131

RESUMO

BACKGROUND: Machine learning has advanced medical event prediction, mostly using private data. The public MIMIC-3 (Medical Information Mart for Intensive Care III) data set, which contains detailed data on over 40,000 intensive care unit patients, stands out as it can help develop better models including structured and textual data. OBJECTIVE: This study aimed to build and test a machine learning model using the MIMIC-3 data set to determine the effectiveness of information extracted from electronic medical record text using a named entity recognition, specifically QuickUMLS, for predicting important medical events. Using the prediction of extended-spectrum ß-lactamase (ESBL)-producing bacterial infections as an example, this study shows how open data sources and simple technology can be useful for making clinically meaningful predictions. METHODS: The MIMIC-3 data set, including demographics, vital signs, laboratory results, and textual data, such as discharge summaries, was used. This study specifically targeted patients diagnosed with Klebsiella pneumoniae or Escherichia coli infection. Predictions were based on ESBL-producing bacterial standards and the minimum inhibitory concentration criteria. Both the structured data and extracted patient histories were used as predictors. In total, 2 models, an L1-regularized logistic regression model and a LightGBM model, were evaluated using the receiver operating characteristic area under the curve (ROC-AUC) and the precision-recall curve area under the curve (PR-AUC). RESULTS: Of 46,520 MIMIC-3 patients, 4046 were identified with bacterial cultures, indicating the presence of K pneumoniae or E coli. After excluding patients who lacked discharge summary text, 3614 patients remained. The L1-penalized model, with variables from only the structured data, displayed a ROC-AUC of 0.646 and a PR-AUC of 0.307. The LightGBM model, combining structured and textual data, achieved a ROC-AUC of 0.707 and a PR-AUC of 0.369. Key contributors to the LightGBM model included patient age, duration since hospital admission, and specific medical history such as diabetes. The structured data-based model showed improved performance compared to the reference models. Performance was further improved when textual medical history was included. Compared to other models predicting drug-resistant bacteria, the results of this study ranked in the middle. Some misidentifications, potentially due to the limitations of QuickUMLS, may have affected the accuracy of the model. CONCLUSIONS: This study successfully developed a predictive model for ESBL-producing bacterial infections using the MIMIC-3 data set, yielding results consistent with existing literature. This model stands out for its transparency and reliance on open data and open-named entity recognition technology. The performance of the model was enhanced using textual information. With advancements in natural language processing tools such as BERT and GPT, the extraction of medical data from text holds substantial potential for future model optimization.

9.
JMIR Form Res ; 8: e55798, 2024 Jun 04.
Artigo em Inglês | MEDLINE | ID: mdl-38833694

RESUMO

BACKGROUND: Large language models have propelled recent advances in artificial intelligence technology, facilitating the extraction of medical information from unstructured data such as medical records. Although named entity recognition (NER) is used to extract data from physicians' records, it has yet to be widely applied to pharmaceutical care records. OBJECTIVE: In this study, we aimed to investigate the feasibility of automatic extraction of the information regarding patients' diseases and symptoms from pharmaceutical care records. The verification was performed using Medical Named Entity Recognition-Japanese (MedNER-J), a Japanese disease-extraction system designed for physicians' records. METHODS: MedNER-J was applied to subjective, objective, assessment, and plan data from the care records of 49 patients who received cefazolin sodium injection at Keio University Hospital between April 2018 and March 2019. The performance of MedNER-J was evaluated in terms of precision, recall, and F1-score. RESULTS: The F1-scores of NER for subjective, objective, assessment, and plan data were 0.46, 0.70, 0.76, and 0.35, respectively. In NER and positive-negative classification, the F1-scores were 0.28, 0.39, 0.64, and 0.077, respectively. The F1-scores of NER for objective (0.70) and assessment data (0.76) were higher than those for subjective and plan data, which supported the superiority of NER performance for objective and assessment data. This might be because objective and assessment data contained many technical terms, similar to the training data for MedNER-J. Meanwhile, the F1-score of NER and positive-negative classification was high for assessment data alone (F1-score=0.64), which was attributed to the similarity of its description format and contents to those of the training data. CONCLUSIONS: MedNER-J successfully read pharmaceutical care records and showed the best performance for assessment data. However, challenges remain in analyzing records other than assessment data. Therefore, it will be necessary to reinforce the training data for subjective data in order to apply the system to pharmaceutical care records.

10.
J Biomed Inform ; 156: 104674, 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38871012

RESUMO

OBJECTIVE: Biomedical Named Entity Recognition (bio NER) is the task of recognizing named entities in biomedical texts. This paper introduces a new model that addresses bio NER by considering additional external contexts. Different from prior methods that mainly use original input sequences for sequence labeling, the model takes into account additional contexts to enhance the representation of entities in the original sequences, since additional contexts can provide enhanced information for the concept explanation of biomedical entities. METHODS: To exploit an additional context, given an original input sequence, the model first retrieves the relevant sentences from PubMed and then ranks the retrieved sentences to form the contexts. It next combines the context with the original input sequence to form a new enhanced sequence. The original and new enhanced sequences are fed into PubMedBERT for learning feature representation. To obtain more fine-grained features, the model stacks a BiLSTM layer on top of PubMedBERT. The final named entity label prediction is done by using a CRF layer. The model is jointly trained in an end-to-end manner to take advantage of the additional context for NER of the original sequence. RESULTS: Experimental results on six biomedical datasets show that the proposed model achieves promising performance compared to strong baselines and confirms the contribution of additional contexts for bio NER. CONCLUSION: The promising results confirm three important points. First, the additional context from PubMed helps to improve the quality of the recognition of biomedical entities. Second, PubMed is more appropriate than the Google search engine for providing relevant information of bio NER. Finally, more relevant sentences from the context are more beneficial than irrelevant ones to provide enhanced information for the original input sequences. The model is flexible to integrate any additional context types for the NER task.

11.
Artif Intell Med ; 154: 102915, 2024 Jun 19.
Artigo em Inglês | MEDLINE | ID: mdl-38936309

RESUMO

Chinese medicine is a unique and complex medical system with complete and rich scientific theories. The textual data of Traditional Chinese Medicine (TCM) contains a large amount of relevant knowledge in the field of TCM, which can serve as guidance for accurate disease diagnosis as well as efficient disease prevention and treatment. Existing TCM texts are disorganized and lack a uniform standard. For this reason, this paper proposes a joint extraction framework by using graph convolutional networks to extract joint entity relations on document-level TCM texts to achieve TCM entity relation mining. More specifically, we first finetune the pre-trained language model by using the TCM domain knowledge to obtain the task-specific model. Taking the integrity of TCM into account, we extract the complete entities as well as the relations corresponding to diagnosis and treatment from the document-level medical cases by using multiple features such as word fusion coding, TCM lexicon information, and multi-relational graph convolutional networks. The experimental results show that the proposed method outperforms the state-of-the-art methods. It has an F1-score of 90.7% for Name Entity Recognization and 76.14% for Relation Extraction on the TCM dataset, which significantly improves the ability to extract entity relations from TCM texts. Code is available at https://github.com/xxxxwx/TCMERE.

12.
Artigo em Inglês | MEDLINE | ID: mdl-38934643

RESUMO

OBJECTIVE: To explore the feasibility of validating Dutch concept extraction tools using annotated corpora translated from English, focusing on preserving annotations during translation and addressing the scarcity of non-English annotated clinical corpora. MATERIALS AND METHODS: Three annotated corpora were standardized and translated from English to Dutch using 2 machine translation services, Google Translate and OpenAI GPT-4, with annotations preserved through a proposed method of embedding annotations in the text before translation. The performance of 2 concept extraction tools, MedSpaCy and MedCAT, was assessed across the corpora in both Dutch and English. RESULTS: The translation process effectively generated Dutch annotated corpora and the concept extraction tools performed similarly in both English and Dutch. Although there were some differences in how annotations were preserved across translations, these did not affect extraction accuracy. Supervised MedCAT models consistently outperformed unsupervised models, whereas MedSpaCy demonstrated high recall but lower precision. DISCUSSION: Our validation of Dutch concept extraction tools on corpora translated from English was successful, highlighting the efficacy of our annotation preservation method and the potential for efficiently creating multilingual corpora. Further improvements and comparisons of annotation preservation techniques and strategies for corpus synthesis could lead to more efficient development of multilingual corpora and accurate non-English concept extraction tools. CONCLUSION: This study has demonstrated that translated English corpora can be used to validate non-English concept extraction tools. The annotation preservation method used during translation proved effective, and future research can apply this corpus translation method to additional languages and clinical settings.

13.
JMIR AI ; 3: e52095, 2024 May 16.
Artigo em Inglês | MEDLINE | ID: mdl-38875593

RESUMO

BACKGROUND: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. OBJECTIVE: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. METHODS: A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. RESULTS: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. CONCLUSIONS: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

14.
Heliyon ; 10(9): e30053, 2024 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-38707358

RESUMO

Identifying valuable information within the extensive texts documented in natural language presents a significant challenge in various disciplines. Named Entity Recognition (NER), as one of the critical technologies in text data processing and mining, has become a current research hotspot. To accurately and objectively review the progress in NER, this paper employs bibliometric methods. It analyzes 1300 documents related to NER obtained from the Web of Science database using CiteSpace software. Firstly, statistical analysis is performed on the literature and journals that were obtained to explore the distribution characteristics of the literature. Secondly, the core authors in the field of NER, the development of the technology in different countries, and the leading institutions are explored by analyzing the number of publications and the cooperation network graph. Finally, explore the research frontiers, development tracks, research hotspots, and other information in this field from a scientific point of view, and further discuss the five research frontiers and seven research hotspots in depth. This paper explores the progress of NER research from both macro and micro perspectives. It aims to assist researchers in quickly grasping relevant information and offers constructive ideas and suggestions to promote the development of NER.

15.
Artigo em Inglês | MEDLINE | ID: mdl-38708849

RESUMO

OBJECTIVES: This article aims to enhance the performance of larger language models (LLMs) on the few-shot biomedical named entity recognition (NER) task by developing a simple and effective method called Retrieving and Chain-of-Thought (RT) framework and to evaluate the improvement after applying RT framework. MATERIALS AND METHODS: Given the remarkable advancements in retrieval-based language model and Chain-of-Thought across various natural language processing tasks, we propose a pioneering RT framework designed to amalgamate both approaches. The RT approach encompasses dedicated modules for information retrieval and Chain-of-Thought processes. In the retrieval module, RT discerns pertinent examples from demonstrations during instructional tuning for each input sentence. Subsequently, the Chain-of-Thought module employs a systematic reasoning process to identify entities. We conducted a comprehensive comparative analysis of our RT framework against 16 other models for few-shot NER tasks on BC5CDR and NCBI corpora. Additionally, we explored the impacts of negative samples, output formats, and missing data on performance. RESULTS: Our proposed RT framework outperforms other LMs for few-shot NER tasks with micro-F1 scores of 93.50 and 91.76 on BC5CDR and NCBI corpora, respectively. We found that using both positive and negative samples, Chain-of-Thought (vs Tree-of-Thought) performed better. Additionally, utilization of a partially annotated dataset has a marginal effect of the model performance. DISCUSSION: This is the first investigation to combine a retrieval-based LLM and Chain-of-Thought methodology to enhance the performance in biomedical few-shot NER. The retrieval-based LLM aids in retrieving the most relevant examples of the input sentence, offering crucial knowledge to predict the entity in the sentence. We also conducted a meticulous examination of our methodology, incorporating an ablation study. CONCLUSION: The RT framework with LLM has demonstrated state-of-the-art performance on few-shot NER tasks.

16.
J Am Med Inform Assoc ; 31(7): 1569-1577, 2024 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-38718216

RESUMO

OBJECTIVE: Social media-based public health research is crucial for epidemic surveillance, but most studies identify relevant corpora with keyword-matching. This study develops a system to streamline the process of curating colloquial medical dictionaries. We demonstrate the pipeline by curating a Unified Medical Language System (UMLS)-colloquial symptom dictionary from COVID-19-related tweets as proof of concept. METHODS: COVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The pipeline includes three modules: a named entity recognition module to detect symptoms in tweets; an entity normalization module to aggregate detected entities; and a mapping module that iteratively maps entities to Unified Medical Language System concepts. A random 500 entity samples were drawn from the final dictionary for accuracy validation. Additionally, we conducted a symptom frequency distribution analysis to compare our dictionary to a pre-defined lexicon from previous research. RESULTS: We identified 498 480 unique symptom entity expressions from the tweets. Pre-processing reduces the number to 18 226. The final dictionary contains 38 175 unique expressions of symptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom distribution analysis found that our dictionary detects more symptoms and is effective at identifying psychiatric disorders like anxiety and depression, often missed by pre-defined lexicons. CONCLUSIONS: This study advances public health research by implementing a novel, systematic pipeline for curating symptom lexicons from social media data. The final lexicon's high accuracy, validated by medical professionals, underscores the potential of this methodology to reliably interpret, and categorize vast amounts of unstructured social media data into actionable medical insights across diverse linguistic and regional landscapes.


Assuntos
COVID-19 , Aprendizado Profundo , Mídias Sociais , Unified Medical Language System , Humanos , Saúde Pública , Armazenamento e Recuperação da Informação/métodos
17.
J Med Internet Res ; 26: e52655, 2024 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-38814687

RESUMO

BACKGROUND: Since the beginning of the COVID-19 pandemic, >1 million studies have been collected within the COVID-19 Open Research Dataset, a corpus of manuscripts created to accelerate research against the disease. Their related abstracts hold a wealth of information that remains largely unexplored and difficult to search due to its unstructured nature. Keyword-based search is the standard approach, which allows users to retrieve the documents of a corpus that contain (all or some of) the words in a target list. This type of search, however, does not provide visual support to the task and is not suited to expressing complex queries or compensating for missing specifications. OBJECTIVE: This study aims to consider small graphs of concepts and exploit them for expressing graph searches over existing COVID-19-related literature, leveraging the increasing use of graphs to represent and query scientific knowledge and providing a user-friendly search and exploration experience. METHODS: We considered the COVID-19 Open Research Dataset corpus and summarized its content by annotating the publications' abstracts using terms selected from the Unified Medical Language System and the Ontology of Coronavirus Infectious Disease. Then, we built a co-occurrence network that includes all relevant concepts mentioned in the corpus, establishing connections when their mutual information is relevant. A sophisticated graph query engine was built to allow the identification of the best matches of graph queries on the network. It also supports partial matches and suggests potential query completions using shortest paths. RESULTS: We built a large co-occurrence network, consisting of 128,249 entities and 47,198,965 relationships; the GRAPH-SEARCH interface allows users to explore the network by formulating or adapting graph queries; it produces a bibliography of publications, which are globally ranked; and each publication is further associated with the specific parts of the query that it explains, thereby allowing the user to understand each aspect of the matching. CONCLUSIONS: Our approach supports the process of query formulation and evidence search upon a large text corpus; it can be reapplied to any scientific domain where documents corpora and curated ontologies are made available.


Assuntos
Algoritmos , COVID-19 , SARS-CoV-2 , COVID-19/epidemiologia , Humanos , Pandemias , Armazenamento e Recuperação da Informação/métodos , Pesquisa Biomédica/métodos , Unified Medical Language System , Ferramenta de Busca
18.
J Proteome Res ; 23(6): 1915-1925, 2024 Jun 07.
Artigo em Inglês | MEDLINE | ID: mdl-38733346

RESUMO

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.


Assuntos
Algoritmos , Aprendizado Profundo , Enzimas , Processamento de Linguagem Natural , Anotação de Sequência Molecular/métodos , Humanos , Mineração de Dados/métodos
19.
Med Ref Serv Q ; 43(2): 196-202, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38722609

RESUMO

Named entity recognition (NER) is a powerful computer system that utilizes various computing strategies to extract information from raw text input, since the early 1990s. With rapid advancement in AI and computing, NER models have gained significant attention and been serving as foundational tools across numerus professional domains to organize unstructured data for research and practical applications. This is particularly evident in the medical and healthcare fields, where NER models are essential in efficiently extract critical information from complex documents that are challenging for manual review. Despite its successes, NER present limitations in fully comprehending natural language nuances. However, the development of more advanced and user-friendly models promises to improve work experiences of professional users significantly.


Assuntos
Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Armazenamento e Recuperação da Informação/métodos , Humanos , Inteligência Artificial
20.
Front Big Data ; 7: 1346958, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38650693

RESUMO

Introduction: Acupuncture and tuina, acknowledged as ancient and highly efficacious therapeutic modalities within the domain of Traditional Chinese Medicine (TCM), have provided pragmatic treatment pathways for numerous patients. To address the problems of ambiguity in the concept of Traditional Chinese Medicine (TCM) acupuncture and tuina treatment protocols, the lack of accurate quantitative assessment of treatment protocols, and the diversity of TCM systems, we have established a map-filling technique for modern literature to achieve personalized medical recommendations. Methods: (1) Extensive acupuncture and tuina data were collected, analyzed, and processed to establish a concise TCM domain knowledge base. (2)A template-free Chinese text NER joint training method (TemplateFC) was proposed, which enhances the EntLM model with BiLSTM and CRF layers. Appropriate rules were set for ERE. (3) A comprehensive knowledge graph comprising 10,346 entities and 40,919 relationships was constructed based on modern literature. Results: A robust TCM KG with a wide range of entities and relationships was created. The template-free joint training approach significantly improved NER accuracy, especially in Chinese text, addressing issues related to entity identification and tokenization differences. The KG provided valuable insights into acupuncture and tuina, facilitating efficient information retrieval and personalized treatment recommendations. Discussion: The integration of KGs in TCM research is essential for advancing diagnostics and interventions. Challenges in NER and ERE were effectively tackled using hybrid approaches and innovative techniques. The comprehensive TCM KG our built contributes to bridging the gap in TCM knowledge and serves as a valuable resource for specialists and non-specialists alike.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA