Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 428
Filtrar
1.
JMIR Med Inform ; 12: e49997, 2024 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-39250782

RESUMO

BACKGROUND: A wealth of clinically relevant information is only obtainable within unstructured clinical narratives, leading to great interest in clinical natural language processing (NLP). While a multitude of approaches to NLP exist, current algorithm development approaches have limitations that can slow the development process. These limitations are exacerbated when the task is emergent, as is the case currently for NLP extraction of signs and symptoms of COVID-19 and postacute sequelae of SARS-CoV-2 infection (PASC). OBJECTIVE: This study aims to highlight the current limitations of existing NLP algorithm development approaches that are exacerbated by NLP tasks surrounding emergent clinical concepts and to illustrate our approach to addressing these issues through the use case of developing an NLP system for the signs and symptoms of COVID-19 and PASC. METHODS: We used 2 preexisting studies on PASC as a baseline to determine a set of concepts that should be extracted by NLP. This concept list was then used in conjunction with the Unified Medical Language System to autonomously generate an expanded lexicon to weakly annotate a training set, which was then reviewed by a human expert to generate a fine-tuned NLP algorithm. The annotations from a fully human-annotated test set were then compared with NLP results from the fine-tuned algorithm. The NLP algorithm was then deployed to 10 additional sites that were also running our NLP infrastructure. Of these 10 sites, 5 were used to conduct a federated evaluation of the NLP algorithm. RESULTS: An NLP algorithm consisting of 12,234 unique normalized text strings corresponding to 2366 unique concepts was developed to extract COVID-19 or PASC signs and symptoms. An unweighted mean dictionary coverage of 77.8% was found for the 5 sites. CONCLUSIONS: The evolutionary and time-critical nature of the PASC NLP task significantly complicates existing approaches to NLP algorithm development. In this work, we present a hybrid approach using the Open Health Natural Language Processing Toolkit aimed at addressing these needs with a dictionary-based weak labeling step that minimizes the need for additional expert annotation while still preserving the fine-tuning capabilities of expert involvement.

2.
Metab Eng ; 86: 1-11, 2024 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-39233197

RESUMO

There have been significant advances in literature mining, allowing for the extraction of target information from the literature. However, biological literature often includes biological pathway images that are difficult to extract in an easily editable format. To address this challenge, this study aims to develop a machine learning framework called the "Extraction of Biological Pathway Information" (EBPI). The framework automates the search for relevant publications, extracts biological pathway information from images within the literature, including genes, enzymes, and metabolites, and generates the output in a tabular format. For this, this framework determines the direction of biochemical reactions, and detects and classifies texts within biological pathway images. Performance of EBPI was evaluated by comparing the extracted pathway information with manually curated pathway maps. EBPI will be useful for extracting biological pathway information from the literature in a high-throughput manner, and can be used for pathway studies, including metabolic engineering.

3.
J Biomed Inform ; 157: 104720, 2024 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-39233209

RESUMO

BACKGROUND: In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research. Particularly, extracting the information required for the pathological stage from surgical pathology reports can be utilized to update cancer staging according to the latest cancer staging guidelines. OBJECTIVES: This study has two main objectives. The first objective is to evaluate the performance of extracting information from text-based surgical pathology reports and determining pathological stages based on the extracted information using fine-tuned generative language models (GLMs) for patients with lung cancer. The second objective is to determine the feasibility of utilizing relatively small GLMs for information extraction in a resource-constrained computing environment. METHODS: Lung cancer surgical pathology reports were collected from the Common Data Model database of Seoul National University Bundang Hospital (SNUBH), a tertiary hospital in Korea. We selected 42 descriptors necessary for tumor-node (TN) classification based on these reports and created a gold standard with validation by two clinical experts. The pathology reports and gold standard were used to generate prompt-response pairs for training and evaluating GLMs which then were used to extract information required for staging from pathology reports. RESULTS: We evaluated the information extraction performance of six trained models as well as their performance in TN classification using the extracted information. The Deductive Mistral-7B model, which was pre-trained with the deductive dataset, showed the best performance overall, with an exact match ratio of 92.24% in the information extraction problem and an accuracy of 0.9876 (predicting T and N classification concurrently) in classification. CONCLUSION: This study demonstrated that training GLMs with deductive datasets can improve information extraction performance, and GLMs with a relatively small number of parameters at approximately seven billion can achieve high performance in this problem. The proposed GLM-based information extraction method is expected to be useful in clinical decision-making support, lung cancer staging and research.

4.
Artigo em Inglês | MEDLINE | ID: mdl-39225779

RESUMO

OBJECTIVE: Clinical notes contain unstructured representations of patient histories, including the relationships between medical problems and prescription drugs. To investigate the relationship between cancer drugs and their associated symptom burden, we extract structured, semantic representations of medical problem and drug information from the clinical narratives of oncology notes. MATERIALS AND METHODS: We present Clinical concept Annotations for Cancer Events and Relations (CACER), a novel corpus with fine-grained annotations for over 48 000 medical problems and drug events and 10 000 drug-problem and problem-problem relations. Leveraging CACER, we develop and evaluate transformer-based information extraction models such as Bidirectional Encoder Representations from Transformers (BERT), Fine-tuned Language Net Text-To-Text Transfer Transformer (Flan-T5), Large Language Model Meta AI (Llama3), and Generative Pre-trained Transformers-4 (GPT-4) using fine-tuning and in-context learning (ICL). RESULTS: In event extraction, the fine-tuned BERT and Llama3 models achieved the highest performance at 88.2-88.0 F1, which is comparable to the inter-annotator agreement (IAA) of 88.4 F1. In relation extraction, the fine-tuned BERT, Flan-T5, and Llama3 achieved the highest performance at 61.8-65.3 F1. GPT-4 with ICL achieved the worst performance across both tasks. DISCUSSION: The fine-tuned models significantly outperformed GPT-4 in ICL, highlighting the importance of annotated training data and model optimization. Furthermore, the BERT models performed similarly to Llama3. For our task, large language models offer no performance advantage over the smaller BERT models. CONCLUSIONS: We introduce CACER, a novel corpus with fine-grained annotations for medical problems, drugs, and their relationships in clinical narratives of oncology notes. State-of-the-art transformer models achieved performance comparable to IAA for several extraction tasks.

5.
Heliyon ; 10(12): e32479, 2024 Jun 30.
Artigo em Inglês | MEDLINE | ID: mdl-39183851

RESUMO

Numerous methods and pipelines have recently emerged for the automatic extraction of knowledge graphs from documents such as scientific publications and patents. However, adapting these methods to incorporate alternative text sources like micro-blogging posts and news has proven challenging as they struggle to model open-domain entities and relations, typically found in these sources. In this paper, we propose an enhanced information extraction pipeline tailored to the extraction of a knowledge graph comprising open-domain entities from micro-blogging posts on social media platforms. Our pipeline leverages dependency parsing and classifies entity relations in an unsupervised manner through hierarchical clustering over word embeddings. We provide a use case on extracting semantic triples from a corpus of 100 thousand tweets about digital transformation and publicly release the generated knowledge graph. On the same dataset, we conduct two experimental evaluations, showing that the system produces triples with precision over 95% and outperforms similar pipelines of around 5% in terms of precision, while generating a comparatively higher number of triples.

6.
Stud Health Technol Inform ; 316: 1669-1673, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176531

RESUMO

BACKGROUND: The rapid technical progress in the domain of clinical Natural Language Processing and information extraction (IE) has resulted in challenges concerning the comparability and replicability of studies. AIM: This paper proposes a reporting guideline to standardize the description of methodologies and outcomes for studies involving IE from clinical texts. METHODS: The guideline is developed based on the experiences gained from data extraction for a previously conducted scoping review on IE from free-text radiology reports including 34 studies. RESULTS: The guideline comprises the five top-level categories information model, architecture, data, annotation, and outcomes. In total, we define 28 aspects to be reported on in IE studies related to these categories. CONCLUSIONS: The proposed guideline is expected to set a standard for reporting in studies describing IE from clinical text and promote uniformity across the research field. Expected future technological advancements may make regular updates of the guideline necessary. In future research, we plan to develop a taxonomy that clearly defines corresponding value sets as well as integrating both this guideline and the taxonomy by following a consensus-based methodology.


Assuntos
Processamento de Linguagem Natural , Humanos , Guias como Assunto , Armazenamento e Recuperação da Informação/normas
7.
Stud Health Technol Inform ; 316: 171-175, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176700

RESUMO

Integration of free texts from reports written by physicians to an interoperable standard is important for improving patient-centric care and research in the medical domain. In the context of unstructured clinical data, NLP Information Extraction serves in finding information in unstructured text. To our best knowledge, there is no efficient solution, in which extracted Named-Entities of an NLP pipeline can be ad-hoc inserted in openEHR compositions. We therefore developed a software solution that solves this data integration problem by mapping Named-Entities of an NLP pipeline to the fields of an openEHR template. The mapping can be accomplished by any user without any programming intervention and allows the ad-hoc creation of a composition based on the mappings.


Assuntos
Processamento de Linguagem Natural , Semântica , Registros Eletrônicos de Saúde , Software , Humanos , Armazenamento e Recuperação da Informação/métodos
8.
Stud Health Technol Inform ; 316: 214-215, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176711

RESUMO

Automatic extraction of body-text within clinical PDF documents is necessary to enhance downstream NLP tasks but remains a challenge. This study presents an unsupervised algorithm designed to extract body-text leveraging large volume of data. Using DBSCAN clustering over aggregate pages, our method extracts and organize text blocks using their content and coordinates. Evaluation results demonstrate precision scores ranging from 0.82 to 0.98, recall scores from 0.62 to 0.94, and F1-scores from 0.71 to 0.96 across various medical specialty sources. Future work includes dynamic parameter adjustments for improved accuracy and using larger datasets.


Assuntos
Processamento de Linguagem Natural , Algoritmos , Mineração de Dados/métodos , Humanos , Registros Eletrônicos de Saúde , Aprendizado de Máquina não Supervisionado
9.
Stud Health Technol Inform ; 316: 685-689, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176835

RESUMO

With cancer being a leading cause of death globally, epidemiological and clinical cancer registration is paramount for enhancing oncological care and facilitating scientific research. However, the heterogeneous landscape of medical data presents significant challenges to the current manual process of tumor documentation. This paper explores the potential of Large Language Models (LLMs) for transforming unstructured medical reports into the structured format mandated by the German Basic Oncology Dataset. Our findings indicate that integrating LLMs into existing hospital data management systems or cancer registries can significantly enhance the quality and completeness of cancer data collection - a vital component for diagnosing and treating cancer and improving the effectiveness and benefits of therapies. This work contributes to the broader discussion on the potential of artificial intelligence or LLMs to revolutionize medical data processing and reporting in general and cancer care in particular.


Assuntos
Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Neoplasias , Alemanha , Humanos , Neoplasias/terapia , Sistema de Registros , Inteligência Artificial , Oncologia , Confiabilidade dos Dados
10.
Stud Health Technol Inform ; 316: 899-903, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176938

RESUMO

Open source, lightweight and offline generative large language models (LLMs) hold promise for clinical information extraction due to their suitability to operate in secured environments using commodity hardware without token cost. By creating a simple lupus nephritis (LN) renal histopathology annotation schema and generating gold standard data, this study investigates prompt-based strategies using three state-of-the-art lightweight LLMs, namely BioMistral-DARE-7B (BioMistral), Llama-2-13B (Llama 2), and Mistral-7B-instruct-v0.2 (Mistral). We examine the performance of these LLMs within a zero-shot learning environment for renal histopathology report information extraction. Incorporating four prompting strategies, including combinations of batch prompt (BP), single task prompt (SP), chain of thought (CoT) and standard simple prompt (SSP), our findings indicate that both Mistral and BioMistral consistently demonstrated higher performance compared to Llama 2. Mistral recorded the highest performance, achieving an F1-score of 0.996 [95% CI: 0.993, 0.999] for extracting the numbers of various subtypes of glomeruli across all BP settings and 0.898 [95% CI: 0.871, 0.921] in extracting relational values of immune markers under the BP+SSP setting. This study underscores the capability of offline LLMs to provide accurate and secure clinical information extraction, which can serve as a promising alternative to their heavy-weight online counterparts.


Assuntos
Nefrite Lúpica , Processamento de Linguagem Natural , Nefrite Lúpica/patologia , Humanos , Registros Eletrônicos de Saúde , Mineração de Dados/métodos , Armazenamento e Recuperação da Informação/métodos
11.
Stud Health Technol Inform ; 316: 909-913, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176940

RESUMO

Electronic Health Records (EHRs) contain a wealth of unstructured patient data, making it challenging for physicians to do informed decisions. In this paper, we introduce a Natural Language Processing (NLP) approach for the extraction of therapies, diagnosis, and symptoms from ambulatory EHRs of patients with chronic Lupus disease. We aim to demonstrate the effort of a comprehensive pipeline where a rule-based system is combined with text segmentation, transformer-based topic analysis and clinical ontology, in order to enhance text preprocessing and automate rules' identification. Our approach is applied on a sub-cohort of 56 patients, with a total of 750 EHRs written in Italian language, achieving an Accuracy and an F-score over 97% and 90% respectively, in the three extracted domains. This work has the potential to be integrated with EHR systems to automate information extraction, minimizing the human intervention, and providing personalized digital solutions in the chronic Lupus disease domain.


Assuntos
Registros Eletrônicos de Saúde , Lúpus Eritematoso Sistêmico , Processamento de Linguagem Natural , Humanos , Doença Crônica , Mineração de Dados/métodos
12.
Stud Health Technol Inform ; 316: 1775-1779, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176561

RESUMO

Hand-labelling clinical corpora can be costly and inflexible, requiring re-annotation every time new classes need to be extracted. PICO (Participant, Intervention, Comparator, Outcome) information extraction can expedite conducting systematic reviews to answer clinical questions. However, PICO frequently extends to other entities such as Study type and design, trial context, and timeframe, requiring manual re-annotation of existing corpora. In this paper, we adapt Snorkel's weak supervision methodology to extend clinical corpora to new entities without extensive hand labelling. Specifically, we enrich the EBM-PICO corpus with new entities through an example of "Study type and design" extraction. Using weak supervision, we obtain programmatic labels on 4,081 EBM-PICO documents, achieving an F1-score of 85.02% on the test set.


Assuntos
Armazenamento e Recuperação da Informação , Revisões Sistemáticas como Assunto , Humanos , Mineração de Dados/métodos , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural
13.
Proc Biol Sci ; 291(2027): 20240423, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-39082244

RESUMO

In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.


Assuntos
Evolução Biológica , Mineração de Dados , Ecologia , Processamento de Linguagem Natural , Ecologia/métodos , Aprendizado de Máquina
14.
Artigo em Inglês | MEDLINE | ID: mdl-39001795

RESUMO

OBJECTIVES: Alzheimer's disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients' subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression. MATERIALS AND METHODS: A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset. RESULTS: The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89). DISCUSSION: Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data. CONCLUSION: The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.

15.
Artif Intell Med ; 154: 102924, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38964194

RESUMO

BACKGROUND: Radiology reports are typically written in a free-text format, making clinical information difficult to extract and use. Recently, the adoption of structured reporting (SR) has been recommended by various medical societies thanks to the advantages it offers, e.g. standardization, completeness, and information retrieval. We propose a pipeline to extract information from Italian free-text radiology reports that fits with the items of the reference SR registry proposed by a national society of interventional and medical radiology, focusing on CT staging of patients with lymphoma. METHODS: Our work aims to leverage the potential of Natural Language Processing and Transformer-based models to deal with automatic SR registry filling. With the availability of 174 Italian radiology reports, we investigate a rule-free generative Question Answering approach based on the Italian-specific version of T5: IT5. To address information content discrepancies, we focus on the six most frequently filled items in the annotations made on the reports: three categorical (multichoice), one free-text (free-text), and two continuous numerical (factual). In the preprocessing phase, we encode also information that is not supposed to be entered. Two strategies (batch-truncation and ex-post combination) are implemented to comply with the IT5 context length limitations. Performance is evaluated in terms of strict accuracy, f1, and format accuracy, and compared with the widely used GPT-3.5 Large Language Model. Unlike multichoice and factual, free-text answers do not have 1-to-1 correspondence with their reference annotations. For this reason, we collect human-expert feedback on the similarity between medical annotations and generated free-text answers, using a 5-point Likert scale questionnaire (evaluating the criteria of correctness and completeness). RESULTS: The combination of fine-tuning and batch splitting allows IT5 ex-post combination to achieve notable results in terms of information extraction of different types of structured data, performing on par with GPT-3.5. Human-based assessment scores of free-text answers show a high correlation with the AI performance metrics f1 (Spearman's correlation coefficients>0.5, p-values<0.001) for both IT5 ex-post combination and GPT-3.5. The latter is better at generating plausible human-like statements, even if it systematically provides answers even when they are not supposed to be given. CONCLUSIONS: In our experimental setting, a fine-tuned Transformer-based model with a modest number of parameters (i.e., IT5, 220 M) performs well as a clinical information extraction system for automatic SR registry filling task. It can extract information from more than one place in the report, elaborating it in a manner that complies with the response specifications provided by the SR registry (for multichoice and factual items), or that closely approximates the work of a human-expert (free-text items); with the ability to discern when an answer is supposed to be given or not to a user query.


Assuntos
Processamento de Linguagem Natural , Humanos , Sistemas de Informação em Radiologia/organização & administração , Sistemas de Informação em Radiologia/normas , Itália , Registros Eletrônicos de Saúde/normas
16.
Sci Total Environ ; 949: 174948, 2024 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-39059647

RESUMO

Flood disasters cause significant casualties and economic losses annually worldwide. During disasters, accurate and timely information is crucial for disaster management. However, remote sensing cannot balance temporal and spatial resolution, and the coverage of specialized equipment is limited, making continuous monitoring challenging. Real-time disaster-related information shared by social media users offers new possibilities for monitoring. We propose a framework for extracting and analyzing flood information from social media, validated through the 2018 Shouguang flood in China. This framework innovatively combines deep learning techniques and regular expression matching techniques to automatically extract key flood-related information from Weibo textual data, such as problems, floodings, needs, rescues, and measures, achieving an accuracy of 83 %, surpassing traditional models like the Biterm Topic Model (BTM). In the spatiotemporal analysis of the disaster, our research identifies critical time points during the disaster through quantitative analysis of the information and explores the spatial distribution of calls for help using Kernel Density Estimation (KDE), followed by identifying the core affected areas using the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm. For semantic analysis, we adopt the Latent Dirichlet Allocation (LDA) algorithm to perform topic modeling on Weibo texts from different regions, identifying the types of disasters affecting each township. Additionally, through correlation analysis, we investigate the relationship between disaster rescue requests and response measures to evaluate the adequacy of flood response measures in each township. The research results demonstrate that this analytical framework can accurately extract disaster information, precisely identify critical time points in flood disasters, locate core affected areas, uncover primary regional issues, and further validate the sufficiency of response measures, therefore enhancing the efficiency in collecting disaster information and analytical capabilities.

17.
JMIR Med Inform ; 12: e59680, 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38954456

RESUMO

BACKGROUND: Named entity recognition (NER) is a fundamental task in natural language processing. However, it is typically preceded by named entity annotation, which poses several challenges, especially in the clinical domain. For instance, determining entity boundaries is one of the most common sources of disagreements between annotators due to questions such as whether modifiers or peripheral words should be annotated. If unresolved, these can induce inconsistency in the produced corpora, yet, on the other hand, strict guidelines or adjudication sessions can further prolong an already slow and convoluted process. OBJECTIVE: The aim of this study is to address these challenges by evaluating 2 novel annotation methodologies, lenient span and point annotation, aiming to mitigate the difficulty of precisely determining entity boundaries. METHODS: We evaluate their effects through an annotation case study on a Japanese medical case report data set. We compare annotation time, annotator agreement, and the quality of the produced labeling and assess the impact on the performance of an NER system trained on the annotated corpus. RESULTS: We saw significant improvements in the labeling process efficiency, with up to a 25% reduction in overall annotation time and even a 10% improvement in annotator agreement compared to the traditional boundary-strict approach. However, even the best-achieved NER model presented some drop in performance compared to the traditional annotation methodology. CONCLUSIONS: Our findings demonstrate a balance between annotation speed and model performance. Although disregarding boundary information affects model performance to some extent, this is counterbalanced by significant reductions in the annotator's workload and notable improvements in the speed of the annotation process. These benefits may prove valuable in various applications, offering an attractive compromise for developers and researchers.

18.
Sci Rep ; 14(1): 14994, 2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-38951207

RESUMO

Artificially extracted agricultural phenotype information exhibits high subjectivity and low accuracy, while the utilization of image extraction information is susceptible to interference from haze. Furthermore, the effectiveness of the agricultural image dehazing method used for extracting such information is limited due to unclear texture details and color representation in the images. To address these limitations, we propose AgriGAN (unpaired image dehazing via a cycle-consistent generative adversarial network) for enhancing the dehazing performance in agricultural plant phenotyping. The algorithm incorporates an atmospheric scattering model to improve the discriminator model and employs a whole-detail consistent discrimination approach to enhance discriminator efficiency, thereby accelerating convergence towards Nash equilibrium state within the adversarial network. Finally, by training with network adversarial loss + cycle consistent loss, clear images are obtained after dehazing process. Experimental evaluations and comparative analysis were conducted to assess this algorithm's performance, demonstrating improved accuracy in dehazing agricultural images while preserving detailed texture information and mitigating color deviation issues.

19.
J Biomed Inform ; 156: 104674, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38871012

RESUMO

OBJECTIVE: Biomedical Named Entity Recognition (bio NER) is the task of recognizing named entities in biomedical texts. This paper introduces a new model that addresses bio NER by considering additional external contexts. Different from prior methods that mainly use original input sequences for sequence labeling, the model takes into account additional contexts to enhance the representation of entities in the original sequences, since additional contexts can provide enhanced information for the concept explanation of biomedical entities. METHODS: To exploit an additional context, given an original input sequence, the model first retrieves the relevant sentences from PubMed and then ranks the retrieved sentences to form the contexts. It next combines the context with the original input sequence to form a new enhanced sequence. The original and new enhanced sequences are fed into PubMedBERT for learning feature representation. To obtain more fine-grained features, the model stacks a BiLSTM layer on top of PubMedBERT. The final named entity label prediction is done by using a CRF layer. The model is jointly trained in an end-to-end manner to take advantage of the additional context for NER of the original sequence. RESULTS: Experimental results on six biomedical datasets show that the proposed model achieves promising performance compared to strong baselines and confirms the contribution of additional contexts for bio NER. CONCLUSION: The promising results confirm three important points. First, the additional context from PubMed helps to improve the quality of the recognition of biomedical entities. Second, PubMed is more appropriate than the Google search engine for providing relevant information of bio NER. Finally, more relevant sentences from the context are more beneficial than irrelevant ones to provide enhanced information for the original input sequences. The model is flexible to integrate any additional context types for the NER task.


Assuntos
Processamento de Linguagem Natural , PubMed , Humanos , Algoritmos , Mineração de Dados/métodos , Semântica , Informática Médica/métodos
20.
PeerJ Comput Sci ; 10: e2004, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38855202

RESUMO

This article presents a semantic web-based solution for extracting the relevant information automatically from the annual financial reports of the banks/financial institutions and presenting this information in a queryable form through a knowledge graph. The information in these reports is significantly desired by various stakeholders for making key investment decisions. However, this information is available in an unstructured format making it much more complex and challenging to understand and query manually or even through digital systems. Another challenge that makes the understanding of information more complex is the variation of terminologies among financial reports of different banks or financial institutions. The solution presented in this article signifies an ontological approach to solving the standardization problems of the terminologies in this domain. It further addresses the issue of semantic differences to extract relevant data sharing common semantics. Such semantics are then incorporated by implementing their representation as a Knowledge Graph to make the information understandable and queryable. Our results highlight the usage of Knowledge Graph in search engines, recommender systems and question-answering (Q-A) systems. This financial knowledge graph can also be used to serve the task of financial storytelling. The proposed solution is implemented and tested on the datasets of various banks and the results are presented through answers to competency questions evaluated on precision and recall measures.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA