Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 426
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Proc Biol Sci ; 291(2027): 20240423, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-39082244

RESUMO

In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.


Assuntos
Evolução Biológica , Mineração de Dados , Ecologia , Processamento de Linguagem Natural , Ecologia/métodos , Aprendizado de Máquina
2.
Brief Bioinform ; 23(4)2022 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-35849019

RESUMO

Medical Dialogue Information Extraction (MDIE) is a promising task for modern medical care systems, which greatly facilitates the development of many real-world applications such as electronic medical record generation, automatic disease diagnosis, etc. Recent methods have firstly achieved considerable performance in Chinese MDIE but still suffer from some inherent limitations, such as poor exploitation of the inter-dependencies in multiple utterances, weak discrimination of the hard samples. In this paper, we propose a contrastive multi-utterance inference (CMUI) method to address these issues. Specifically, we first use a type-aware encoder to provide an efficient encode mechanism toward different categories. Subsequently, we introduce a selective attention mechanism to explicitly capture the dependencies among utterances, which thus constructs a multi-utterance inference. Finally, a supervised contrastive learning approach is integrated into our framework to improve the recognition ability for the hard samples. Extensive experiments show that our model achieves state-of-the-art performance on a public benchmark Chinese-based dataset and delivers significant performance gain on MDIE as compared with baselines. Specifically, we outperform the state-of-the-art results in F1-score by 2.27%, 0.55% in Recall and 3.61% in Precision (The codes that support the findings of this study are openly available in CMUI at https://github.com/jc4357/CMUI.).


Assuntos
Aprendizado Profundo , Armazenamento e Recuperação da Informação , Benchmarking , China , Registros Eletrônicos de Saúde
3.
Metab Eng ; 86: 1-11, 2024 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-39233197

RESUMO

There have been significant advances in literature mining, allowing for the extraction of target information from the literature. However, biological literature often includes biological pathway images that are difficult to extract in an easily editable format. To address this challenge, this study aims to develop a machine learning framework called the "Extraction of Biological Pathway Information" (EBPI). The framework automates the search for relevant publications, extracts biological pathway information from images within the literature, including genes, enzymes, and metabolites, and generates the output in a tabular format. For this, this framework determines the direction of biochemical reactions, and detects and classifies texts within biological pathway images. Performance of EBPI was evaluated by comparing the extracted pathway information with manually curated pathway maps. EBPI will be useful for extracting biological pathway information from the literature in a high-throughput manner, and can be used for pathway studies, including metabolic engineering.

4.
Eur Radiol ; 34(1): 330-337, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37505252

RESUMO

OBJECTIVES: Provide physicians and researchers an efficient way to extract information from weakly structured radiology reports with natural language processing (NLP) machine learning models. METHODS: We evaluate seven different German bidirectional encoder representations from transformers (BERT) models on a dataset of 857,783 unlabeled radiology reports and an annotated reading comprehension dataset in the format of SQuAD 2.0 based on 1223 additional reports. RESULTS: Continued pre-training of a BERT model on the radiology dataset and a medical online encyclopedia resulted in the most accurate model with an F1-score of 83.97% and an exact match score of 71.63% for answerable questions and 96.01% accuracy in detecting unanswerable questions. Fine-tuning a non-medical model without further pre-training led to the lowest-performing model. The final model proved stable against variation in the formulations of questions and in dealing with questions on topics excluded from the training set. CONCLUSIONS: General domain BERT models further pre-trained on radiological data achieve high accuracy in answering questions on radiology reports. We propose to integrate our approach into the workflow of medical practitioners and researchers to extract information from radiology reports. CLINICAL RELEVANCE STATEMENT: By reducing the need for manual searches of radiology reports, radiologists' resources are freed up, which indirectly benefits patients. KEY POINTS: • BERT models pre-trained on general domain datasets and radiology reports achieve high accuracy (83.97% F1-score) on question-answering for radiology reports. • The best performing model achieves an F1-score of 83.97% for answerable questions and 96.01% accuracy for questions without an answer. • Additional radiology-specific pretraining of all investigated BERT models improves their performance.


Assuntos
Armazenamento e Recuperação da Informação , Radiologia , Humanos , Idioma , Aprendizado de Máquina , Processamento de Linguagem Natural
5.
BMC Med Res Methodol ; 24(1): 63, 2024 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-38468224

RESUMO

BACKGROUND: Laboratory data can provide great value to support research aimed at reducing the incidence, prolonging survival and enhancing outcomes of cancer. Data is characterized by the information it carries and the format it holds. Data captured in Alberta's biomarker laboratory repository is free text, cluttered and rouge. Such data format limits its utility and prohibits broader adoption and research development. Text analysis for information extraction of unstructured data can change this and lead to more complete analyses. Previous work on extracting relevant information from free text, unstructured data employed Natural Language Processing (NLP), Machine Learning (ML), rule-based Information Extraction (IE) methods, or a hybrid combination between them. METHODS: In our study, text analysis was performed on Alberta Precision Laboratories data which consisted of 95,854 entries from the Southern Alberta Dataset (SAD) and 6944 entries from the Northern Alberta Dataset (NAD). The data covers all of Alberta and is completely population-based. Our proposed framework is built around rule-based IE methods. It incorporates topics such as Syntax and Lexical analyses to achieve deterministic extraction of data from biomarker laboratory data (i.e., Epidermal Growth Factor Receptor (EGFR) test results). Lexical analysis compromises of data cleaning and pre-processing, Rich Text Format text conversion into readable plain text format, and normalization and tokenization of text. The framework then passes the text into the Syntax analysis stage which includes the rule-based method of extracting relevant data. Rule-based patterns of the test result are identified, and a Context Free Grammar then generates the rules of information extraction. Finally, the results are linked with the Alberta Cancer Registry to support real-world cancer research studies. RESULTS: Of the original 5512 entries in the SAD dataset and 5017 entries in the NAD dataset which were filtered for EGFR, the framework yielded 5129 and 3388 extracted EGFR test results from the SAD and NAD datasets, respectively. An accuracy of 97.5% was achieved on a random sample of 362 tests. CONCLUSIONS: We presented a text analysis framework to extract specific information from unstructured clinical data. Our proposed framework has shown that it can successfully extract relevant information from EGFR test results.


Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Humanos , Carcinoma Pulmonar de Células não Pequenas/diagnóstico , Carcinoma Pulmonar de Células não Pequenas/genética , Laboratórios , NAD , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/genética , Mutação , Processamento de Linguagem Natural , Receptores ErbB , Biomarcadores , Registros Eletrônicos de Saúde
6.
J Biomed Inform ; 156: 104674, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-38871012

RESUMO

OBJECTIVE: Biomedical Named Entity Recognition (bio NER) is the task of recognizing named entities in biomedical texts. This paper introduces a new model that addresses bio NER by considering additional external contexts. Different from prior methods that mainly use original input sequences for sequence labeling, the model takes into account additional contexts to enhance the representation of entities in the original sequences, since additional contexts can provide enhanced information for the concept explanation of biomedical entities. METHODS: To exploit an additional context, given an original input sequence, the model first retrieves the relevant sentences from PubMed and then ranks the retrieved sentences to form the contexts. It next combines the context with the original input sequence to form a new enhanced sequence. The original and new enhanced sequences are fed into PubMedBERT for learning feature representation. To obtain more fine-grained features, the model stacks a BiLSTM layer on top of PubMedBERT. The final named entity label prediction is done by using a CRF layer. The model is jointly trained in an end-to-end manner to take advantage of the additional context for NER of the original sequence. RESULTS: Experimental results on six biomedical datasets show that the proposed model achieves promising performance compared to strong baselines and confirms the contribution of additional contexts for bio NER. CONCLUSION: The promising results confirm three important points. First, the additional context from PubMed helps to improve the quality of the recognition of biomedical entities. Second, PubMed is more appropriate than the Google search engine for providing relevant information of bio NER. Finally, more relevant sentences from the context are more beneficial than irrelevant ones to provide enhanced information for the original input sequences. The model is flexible to integrate any additional context types for the NER task.


Assuntos
Processamento de Linguagem Natural , PubMed , Humanos , Algoritmos , Mineração de Dados/métodos , Semântica , Informática Médica/métodos
7.
J Biomed Inform ; 151: 104618, 2024 03.
Artigo em Inglês | MEDLINE | ID: mdl-38431151

RESUMO

OBJECTIVE: Goals of care (GOC) discussions are an increasingly used quality metric in serious illness care and research. Wide variation in documentation practices within the Electronic Health Record (EHR) presents challenges for reliable measurement of GOC discussions. Novel natural language processing approaches are needed to capture GOC discussions documented in real-world samples of seriously ill hospitalized patients' EHR notes, a corpus with a very low event prevalence. METHODS: To automatically detect sentences documenting GOC discussions outside of dedicated GOC note types, we proposed an ensemble of classifiers aggregating the predictions of rule-based, feature-based, and three transformers-based classifiers. We trained our classifier on 600 manually annotated EHR notes among patients with serious illnesses. Our corpus exhibited an extremely imbalanced ratio between sentences discussing GOC and sentences that do not. This ratio challenges standard supervision methods to train a classifier. Therefore, we trained our classifier with active learning. RESULTS: Using active learning, we reduced the annotation cost to fine-tune our ensemble by 70% while improving its performance in our test set of 176 EHR notes, with 0.557 F1-score for sentence classification and 0.629 for note classification. CONCLUSION: When classifying notes, with a true positive rate of 72% (13/18) and false positive rate of 8% (13/158), our performance may be sufficient for deploying our classifier in the EHR to facilitate bedside clinicians' access to GOC conversations documented outside of dedicated notes types, without overburdening clinicians with false positives. Improvements are needed before using it to enrich trial populations or as an outcome measure.


Assuntos
Comunicação , Documentação , Humanos , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Planejamento de Assistência ao Paciente
8.
J Biomed Inform ; 157: 104720, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39233209

RESUMO

BACKGROUND: In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research. Particularly, extracting the information required for the pathological stage from surgical pathology reports can be utilized to update cancer staging according to the latest cancer staging guidelines. OBJECTIVES: This study has two main objectives. The first objective is to evaluate the performance of extracting information from text-based surgical pathology reports and determining pathological stages based on the extracted information using fine-tuned generative language models (GLMs) for patients with lung cancer. The second objective is to determine the feasibility of utilizing relatively small GLMs for information extraction in a resource-constrained computing environment. METHODS: Lung cancer surgical pathology reports were collected from the Common Data Model database of Seoul National University Bundang Hospital (SNUBH), a tertiary hospital in Korea. We selected 42 descriptors necessary for tumor-node (TN) classification based on these reports and created a gold standard with validation by two clinical experts. The pathology reports and gold standard were used to generate prompt-response pairs for training and evaluating GLMs which then were used to extract information required for staging from pathology reports. RESULTS: We evaluated the information extraction performance of six trained models as well as their performance in TN classification using the extracted information. The Deductive Mistral-7B model, which was pre-trained with the deductive dataset, showed the best performance overall, with an exact match ratio of 92.24% in the information extraction problem and an accuracy of 0.9876 (predicting T and N classification concurrently) in classification. CONCLUSION: This study demonstrated that training GLMs with deductive datasets can improve information extraction performance, and GLMs with a relatively small number of parameters at approximately seven billion can achieve high performance in this problem. The proposed GLM-based information extraction method is expected to be useful in clinical decision-making support, lung cancer staging and research.


Assuntos
Neoplasias Pulmonares , Processamento de Linguagem Natural , Estadiamento de Neoplasias , Neoplasias Pulmonares/patologia , Neoplasias Pulmonares/diagnóstico , Humanos , Estadiamento de Neoplasias/métodos , Registros Eletrônicos de Saúde , Mineração de Dados/métodos , Algoritmos , Bases de Dados Factuais
9.
BMC Med Imaging ; 24(1): 86, 2024 Apr 10.
Artigo em Inglês | MEDLINE | ID: mdl-38600525

RESUMO

Medical imaging AI systems and big data analytics have attracted much attention from researchers of industry and academia. The application of medical imaging AI systems and big data analytics play an important role in the technology of content based remote sensing (CBRS) development. Environmental data, information, and analysis have been produced promptly using remote sensing (RS). The method for creating a useful digital map from an image data set is called image information extraction. Image information extraction depends on target recognition (shape and color). For low-level image attributes like texture, Classifier-based Retrieval(CR) techniques are ineffective since they categorize the input images and only return images from the determined classes of RS. The issues mentioned earlier cannot be handled by the existing expertise based on a keyword/metadata remote sensing data service model. To get over these restrictions, Fuzzy Class Membership-based Image Extraction (FCMIE), a technology developed for Content-Based Remote Sensing (CBRS), is suggested. The compensation fuzzy neural network (CFNN) is used to calculate the category label and fuzzy category membership of the query image. Use a basic and balanced weighted distance metric. Feature information extraction (FIE) enhances remote sensing image processing and autonomous information retrieval of visual content based on time-frequency meaning, such as color, texture and shape attributes of images. Hierarchical nested structure and cyclic similarity measure produce faster queries when searching. The experiment's findings indicate that applying the proposed model can have favorable outcomes for assessment measures, including Ratio of Coverage, average means precision, recall, and efficiency retrieval that are attained more effectively than the existing CR model. In the areas of feature tracking, climate forecasting, background noise reduction, and simulating nonlinear functional behaviors, CFNN has a wide range of RS applications. The proposed method CFNN-FCMIE achieves a minimum range of 4-5% for all three feature vectors, sample mean and comparison precision-recall ratio, which gives better results than the existing classifier-based retrieval model. This work provides an important reference for medical imaging artificial intelligence system and big data analysis.


Assuntos
Inteligência Artificial , Tecnologia de Sensoriamento Remoto , Humanos , Ciência de Dados , Armazenamento e Recuperação da Informação , Redes Neurais de Computação
10.
J Med Internet Res ; 26: e55315, 2024 Sep 30.
Artigo em Inglês | MEDLINE | ID: mdl-39348889

RESUMO

BACKGROUND: Ensuring access to accurate and verified information is essential for effective patient treatment and diagnosis. Although health workers rely on the internet for clinical data, there is a need for a more streamlined approach. OBJECTIVE: This systematic review aims to assess the current state of artificial intelligence (AI) and natural language processing (NLP) techniques in health care to identify their potential use in electronic health records and automated information searches. METHODS: A search was conducted in the PubMed, Embase, ScienceDirect, Scopus, and Web of Science online databases for articles published between January 2000 and April 2023. The only inclusion criteria were (1) original research articles and studies on the application of AI-based medical clinical decision support using NLP techniques and (2) publications in English. A Critical Appraisal Skills Programme tool was used to assess the quality of the studies. RESULTS: The search yielded 707 articles, from which 26 studies were included (24 original articles and 2 systematic reviews). Of the evaluated articles, 21 (81%) explained the use of NLP as a source of data collection, 18 (69%) used electronic health records as a data source, and a further 8 (31%) were based on clinical data. Only 5 (19%) of the articles showed the use of combined strategies for NLP to obtain clinical data. In total, 16 (62%) articles presented stand-alone data review algorithms. Other studies (n=9, 35%) showed that the clinical decision support system alternative was also a way of displaying the information obtained for immediate clinical use. CONCLUSIONS: The use of NLP engines can effectively improve clinical decision systems' accuracy, while biphasic tools combining AI algorithms and human criteria may optimize clinical diagnosis and treatment flows. TRIAL REGISTRATION: PROSPERO CRD42022373386; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=373386.


Assuntos
Sistemas de Apoio a Decisões Clínicas , Registros Eletrônicos de Saúde , Processamento de Linguagem Natural , Humanos , Inteligência Artificial
11.
BMC Med Inform Decis Mak ; 24(1): 283, 2024 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-39363322

RESUMO

AIMS: The primary goal of this study is to evaluate the capabilities of Large Language Models (LLMs) in understanding and processing complex medical documentation. We chose to focus on the identification of pathologic complete response (pCR) in narrative pathology reports. This approach aims to contribute to the advancement of comprehensive reporting, health research, and public health surveillance, thereby enhancing patient care and breast cancer management strategies. METHODS: The study utilized two analytical pipelines, developed with open-source LLMs within the healthcare system's computing environment. First, we extracted embeddings from pathology reports using 15 different transformer-based models and then employed logistic regression on these embeddings to classify the presence or absence of pCR. Secondly, we fine-tuned the Generative Pre-trained Transformer-2 (GPT-2) model by attaching a simple feed-forward neural network (FFNN) layer to improve the detection performance of pCR from pathology reports. RESULTS: In a cohort of 351 female breast cancer patients who underwent neoadjuvant chemotherapy (NAC) and subsequent surgery between 2010 and 2017 in Calgary, the optimized method displayed a sensitivity of 95.3% (95%CI: 84.0-100.0%), a positive predictive value of 90.9% (95%CI: 76.5-100.0%), and an F1 score of 93.0% (95%CI: 83.7-100.0%). The results, achieved through diverse LLM integration, surpassed traditional machine learning models, underscoring the potential of LLMs in clinical pathology information extraction. CONCLUSIONS: The study successfully demonstrates the efficacy of LLMs in interpreting and processing digital pathology data, particularly for determining pCR in breast cancer patients post-NAC. The superior performance of LLM-based pipelines over traditional models highlights their significant potential in extracting and analyzing key clinical data from narrative reports. While promising, these findings highlight the need for future external validation to confirm the reliability and broader applicability of these methods.


Assuntos
Neoplasias da Mama , Humanos , Neoplasias da Mama/patologia , Feminino , Pessoa de Meia-Idade , Redes Neurais de Computação , Processamento de Linguagem Natural , Adulto , Idoso , Terapia Neoadjuvante , Resposta Patológica Completa
12.
BMC Med Inform Decis Mak ; 24(Suppl 5): 262, 2024 Sep 17.
Artigo em Inglês | MEDLINE | ID: mdl-39289714

RESUMO

BACKGROUND: Applying graph convolutional networks (GCN) to the classification of free-form natural language texts leveraged by graph-of-words features (TextGCN) was studied and confirmed to be an effective means of describing complex natural language texts. However, the text classification models based on the TextGCN possess weaknesses in terms of memory consumption and model dissemination and distribution. In this paper, we present a fast message passing network (FastMPN), implementing a GCN with message passing architecture that provides versatility and flexibility by allowing trainable node embedding and edge weights, helping the GCN model find the better solution. We applied the FastMPN model to the task of clinical information extraction from cancer pathology reports, extracting the following six properties: main site, subsite, laterality, histology, behavior, and grade. RESULTS: We evaluated the clinical task performance of the FastMPN models in terms of micro- and macro-averaged F1 scores. A comparison was performed with the multi-task convolutional neural network (MT-CNN) model. Results show that the FastMPN model is equivalent to or better than the MT-CNN. CONCLUSIONS: Our implementation revealed that our FastMPN model, which is based on the PyTorch platform, can train a large corpus (667,290 training samples) with 202,373 unique words in less than 3 minutes per epoch using one NVIDIA V100 hardware accelerator. Our experiments demonstrated that using this implementation, the clinical task performance scores of information extraction related to tumors from cancer pathology reports were highly competitive.


Assuntos
Processamento de Linguagem Natural , Neoplasias , Redes Neurais de Computação , Humanos , Neoplasias/classificação , Mineração de Dados
13.
Sensors (Basel) ; 24(6)2024 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-38544085

RESUMO

Functional near-infrared spectroscopy (fNIRS) can dynamically respond to the relevant state of brain activity based on the hemodynamic information of brain tissue. The cerebral cortex and gray matter are the main regions reflecting brain activity. As they are far from the scalp surface, the accuracy of brain activity detection will be significantly affected by a series of physiological activities. In this paper, an effective algorithm for extracting brain activity information is designed based on the measurement method of dual detectors so as to obtain real brain activity information. The principle of this algorithm is to take the measurement results of short-distance channels as reference signals to eliminate the physiological interference information in the measurement results of long-distance channels. In this paper, the performance of the proposed method is tested using both simulated and measured signals and compared with the extraction results of EEMD-RLS, RLS and fast-ICA, and their extraction effects are quantified by correlation coefficient (R), root-mean-square error (RMSE), and mean absolute error (MAE). The test results show that even under low SNR conditions, the proposed method can still effectively suppress physiological interference and improve the detection accuracy of brain activity signals.


Assuntos
Encéfalo , Oxigênio , Encéfalo/fisiologia , Espectroscopia de Luz Próxima ao Infravermelho/métodos , Couro Cabeludo , Algoritmos
14.
Sensors (Basel) ; 24(9)2024 May 06.
Artigo em Inglês | MEDLINE | ID: mdl-38733057

RESUMO

Multi-layer complex structures are widely used in large-scale engineering structures because of their diverse combinations of properties and excellent overall performance. However, multi-layer complex structures are prone to interlaminar debonding damage during use. Therefore, it is necessary to monitor debonding damage in engineering applications to determine structural integrity. In this paper, a damage information extraction method with ladder feature mining for Lamb waves is proposed. The method is able to optimize and screen effective damage information through ladder-type damage extraction. It is suitable for evaluating the severity of debonding damage in aluminum-foamed silicone rubber, a novel multi-layer complex structure. The proposed method contains ladder feature mining stages of damage information selection and damage feature fusion, realizing a multi-level damage information extraction process from coarse to fine. The results show that the accuracy of damage severity assessment by the damage information extraction method with ladder feature mining is improved by more than 5% compared to other methods. The effectiveness and accuracy of the method in assessing the damage severity of multi-layer complex structures are demonstrated, providing a new perspective and solution for damage monitoring of multi-layer complex structures.

15.
Sensors (Basel) ; 24(11)2024 May 29.
Artigo em Inglês | MEDLINE | ID: mdl-38894293

RESUMO

Effective lane detection technology plays an important role in the current autonomous driving system. Although deep learning models, with their intricate network designs, have proven highly capable of detecting lanes, there persist key areas requiring attention. Firstly, the symmetry inherent in visuals captured by forward-facing automotive cameras is an underexploited resource. Secondly, the vast potential of position information remains untapped, which can undermine detection precision. In response to these challenges, we propose FF-HPINet, a novel approach for lane detection. We introduce the Flipped Feature Extraction module, which models pixel pairwise relationships between the flipped feature and the original feature. This module allows us to capture symmetrical features and obtain high-level semantic feature maps from different receptive fields. Additionally, we design the Hierarchical Position Information Extraction module to meticulously mine the position information of the lanes, vastly improving target identification accuracy. Furthermore, the Deformable Context Extraction module is proposed to distill vital foreground elements and contextual nuances from the surrounding environment, yielding focused and contextually apt feature representations. Our approach achieves excellent performance with the F1 score of 97.00% on the TuSimple dataset and 76.84% on the CULane dataset.

16.
Med Ref Serv Q ; 43(2): 196-202, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38722609

RESUMO

Named entity recognition (NER) is a powerful computer system that utilizes various computing strategies to extract information from raw text input, since the early 1990s. With rapid advancement in AI and computing, NER models have gained significant attention and been serving as foundational tools across numerus professional domains to organize unstructured data for research and practical applications. This is particularly evident in the medical and healthcare fields, where NER models are essential in efficiently extract critical information from complex documents that are challenging for manual review. Despite its successes, NER present limitations in fully comprehending natural language nuances. However, the development of more advanced and user-friendly models promises to improve work experiences of professional users significantly.


Assuntos
Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Armazenamento e Recuperação da Informação/métodos , Humanos , Inteligência Artificial
17.
BMC Bioinformatics ; 24(1): 328, 2023 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-37658330

RESUMO

BACKGROUND: Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. RESULTS: We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler "bag of words" or convolutional neural network models. CONCLUSION: When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.


Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Humanos , Carcinoma Pulmonar de Células não Pequenas/diagnóstico por imagem , Neoplasias Pulmonares/diagnóstico por imagem , Progressão da Doença , Fontes de Energia Elétrica , Registros Eletrônicos de Saúde
18.
BMC Bioinformatics ; 24(1): 93, 2023 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-36918766

RESUMO

BACKGROUND: Drug-drug interactions (DDIs) prediction is vital for pharmacology and clinical application to avoid adverse drug reactions on patients. It is challenging because DDIs are related to multiple factors, such as genes, drug molecular structure, diseases, biological processes, side effects, etc. It is a crucial technology for Knowledge graph to present multi-relation among entities. Recently some existing graph-based computation models have been proposed for DDIs prediction and get good performance. However, there are still some challenges in the knowledge graph representation, which can extract rich latent features from drug knowledge graph (KG). RESULTS: In this work, we propose a novel multi-view feature representation and fusion (MuFRF) architecture to realize DDIs prediction. It consists of two views of feature representation and a multi-level latent feature fusion. For the feature representation from the graph view and KG view, we use graph isomorphism network to map drug molecular structures and use RotatE to implement the vector representation on bio-medical knowledge graph, respectively. We design concatenate-level and scalar-level strategies in the multi-level latent feature fusion to capture latent features from drug molecular structure information and semantic features from bio-medical KG. And the multi-head attention mechanism achieves the optimization of features on binary and multi-class classification tasks. We evaluate our proposed method based on two open datasets in the experiments. Experiments indicate that MuFRF outperforms the classic and state-of-the-art models. CONCLUSIONS: Our proposed model can fully exploit and integrate the latent feature from the drug molecular structure graph (graph view) and rich bio-medical knowledge graph (KG view). We find that a multi-view feature representation and fusion model can accurately predict DDIs. It may contribute to providing with some guidance for research and validation for discovering novel DDIs.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Interações Medicamentosas , Conhecimento , Semântica
19.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32591802

RESUMO

Biomedical information extraction (BioIE) is an important task. The aim is to analyze biomedical texts and extract structured information such as named entities and semantic relations between them. In recent years, pre-trained language models have largely improved the performance of BioIE. However, they neglect to incorporate external structural knowledge, which can provide rich factual information to support the underlying understanding and reasoning for biomedical information extraction. In this paper, we first evaluate current extraction methods, including vanilla neural networks, general language models and pre-trained contextualized language models on biomedical information extraction tasks, including named entity recognition, relation extraction and event extraction. We then propose to enrich a contextualized language model by integrating a large scale of biomedical knowledge graphs (namely, BioKGLM). In order to effectively encode knowledge, we explore a three-stage training procedure and introduce different fusion strategies to facilitate knowledge injection. Experimental results on multiple tasks show that BioKGLM consistently outperforms state-of-the-art extraction models. A further analysis proves that BioKGLM can capture the underlying relations between biomedical knowledge concepts, which are crucial for BioIE.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural , Redes Neurais de Computação , Semântica
20.
Brief Bioinform ; 22(2): 781-799, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-33279995

RESUMO

More than 50 000 papers have been published about COVID-19 since the beginning of 2020 and several hundred new papers continue to be published every day. This incredible rate of scientific productivity leads to information overload, making it difficult for researchers, clinicians and public health officials to keep up with the latest findings. Automated text mining techniques for searching, reading and summarizing papers are helpful for addressing information overload. In this review, we describe the many resources that have been introduced to support text mining applications over the COVID-19 literature; specifically, we discuss the corpora, modeling resources, systems and shared tasks that have been introduced for COVID-19. We compile a list of 39 systems that provide functionality such as search, discovery, visualization and summarization over the COVID-19 literature. For each system, we provide a qualitative description and assessment of the system's performance, unique data or user interface features and modeling decisions. Many systems focus on search and discovery, though several systems provide novel features, such as the ability to summarize findings over multiple documents or linking between scientific articles and clinical trials. We also describe the public corpora, models and shared tasks that have been introduced to help reduce repeated effort among community members; some of these resources (especially shared tasks) can provide a basis for comparing the performance of different systems. Finally, we summarize promising results and open challenges for text mining the COVID-19 literature.


Assuntos
COVID-19/epidemiologia , Mineração de Dados/métodos , COVID-19/virologia , Humanos , SARS-CoV-2/isolamento & purificação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA