Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 403
Filtrar
1.
Stud Health Technol Inform ; 316: 827-831, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176920

RESUMO

Finding relevant information in the biomedical literature increasingly depends on efficient information retrieval (IR) algorithms. Cross-Encoders, SentenceBERT, and ColBERT are algorithms based on pre-trained language models that use nuanced but computable vector representations of search queries and documents for IR applications. Here we investigate how well these vectorization algorithms estimate relevance labels of biomedical documents for search queries using the OHSUMED dataset. For our evaluation, we compared computed scores to provided labels by using boxplots and Spearman's rank correlations. According to these metrics, we found that Sentence-BERT moderately outperformed the alternative vectorization algorithms and that additional fine-tuning based on a subset of OHSUMED labels yielded little additional benefit. Future research might aim to develop a larger dedicated dataset in order to optimize such methods more systematically, and to evaluate the corresponding functions in IR tools with end-users.


Assuntos
Algoritmos , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Armazenamento e Recuperação da Informação/métodos , Humanos
2.
Stud Health Technol Inform ; 316: 894-898, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176937

RESUMO

With the objective of extracting new knowledge about rare diseases from social media messages, we evaluated three models on a Named Entity Recognition (NER) task, consisting of extracting phenotypes and treatments from social media messages. We trained the three models on a dataset with social media messages about Developmental and Epileptic Encephalopathies and more common diseases. This preliminary study revealed that CamemBERT and CamemBERT-bio exhibit similar performance on social media testimonials, slightly outperforming DrBERT. It also highlighted that their performance was lower on this type of data than on structured health datasets. Limitations, including a narrow focus on NER performance and dataset-specific evaluation, call for further research to fully assess model capabilities on larger and more diverse datasets.


Assuntos
Mídias Sociais , França , Humanos , Processamento de Linguagem Natural , Mineração de Dados/métodos , Doenças Raras
3.
Stud Health Technol Inform ; 316: 1008-1012, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176961

RESUMO

Coding according to the International Classification of Diseases (ICD)-10 and its clinical modifications (CM) is inherently complex and expensive. Natural Language Processing (NLP) assists by simplifying the analysis of unstructured data from electronic health records, thereby facilitating diagnosis coding. This study investigates the suitability of transformer models for ICD-10 classification, considering both encoder and encoder-decoder architectures. The analysis is performed on clinical discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset, which contains an extensive collection of electronic health records. Pre-trained models such as BioBERT, ClinicalBERT, ClinicalLongformer, and ClinicalBigBird are adapted for the coding task, incorporating specific preprocessing techniques to enhance performance. The findings indicate that increasing context length improves accuracy, and that the difference in accuracy between encoder and encoder-decoder models is negligible.


Assuntos
Registros Eletrônicos de Saúde , Classificação Internacional de Doenças , Processamento de Linguagem Natural , Registros Eletrônicos de Saúde/classificação , Humanos , Codificação Clínica
4.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39177262

RESUMO

The T cell receptor (TCR) repertoire is pivotal to the human immune system, and understanding its nuances can significantly enhance our ability to forecast cancer-related immune responses. However, existing methods often overlook the intra- and inter-sequence interactions of T cell receptors (TCRs), limiting the development of sequence-based cancer-related immune status predictions. To address this challenge, we propose BertTCR, an innovative deep learning framework designed to predict cancer-related immune status using TCRs. BertTCR combines a pre-trained protein large language model with deep learning architectures, enabling it to extract deeper contextual information from TCRs. Compared to three state-of-the-art sequence-based methods, BertTCR improves the AUC on an external validation set for thyroid cancer detection by 21 percentage points. Additionally, this model was trained on over 2000 publicly available TCR libraries covering 17 types of cancer and healthy samples, and it has been validated on multiple public external datasets for its ability to distinguish cancer patients from healthy individuals. Furthermore, BertTCR can accurately classify various cancer types and healthy individuals. Overall, BertTCR is the advancing method for cancer-related immune status forecasting based on TCRs, offering promising potential for a wide range of immune status prediction tasks.


Assuntos
Aprendizado Profundo , Neoplasias , Receptores de Antígenos de Linfócitos T , Humanos , Receptores de Antígenos de Linfócitos T/imunologia , Receptores de Antígenos de Linfócitos T/genética , Receptores de Antígenos de Linfócitos T/metabolismo , Neoplasias/imunologia , Biologia Computacional/métodos , Neoplasias da Glândula Tireoide/imunologia
5.
Front Psychol ; 15: 1430060, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39184940

RESUMO

A number of high-frequency word lists have been created to help foreign language learners master English vocabulary. These word lists, despite their widespread use, did not take word meaning into consideration. Foreign language learners are unclear on which meanings they should focus on first. To address this issue, we semantically annotated the Corpus of Contemporary American English (COCA) and the British National Corpus (BNC) with high accuracy using a BERT model. From these annotated corpora, we calculated the semantic frequency of different senses and filtered out 5000 senses to create a High-frequency Sense List. Subsequently, we checked the validity of this list and compared it with established influential word lists. This list exhibits three notable characteristics. First, it achieves stable coverage in different corpora. Second, it identifies high-frequency items with greater accuracy. It achieves comparable coverage with lists like GSL, NGSL, and New-GSL but with significantly fewer items. Especially, it includes everyday words that used to fall off high-frequency lists without requiring manual adjustments. Third, it describes clearly which senses are most frequently used and therefore should be focused on by beginning learners. This study represents a pioneering effort in semantic annotation of large corpora and the creation of a word list based on semantic frequency.

6.
Front Artif Intell ; 7: 1454945, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39210937

RESUMO

Background: In the field of evidence-based medicine, randomized controlled trials (RCTs) are of critical importance for writing clinical guidelines and providing guidance to practicing physicians. Currently, RCTs rely heavily on manual extraction, but this method has data breadth limitations and is less efficient. Objectives: To expand the breadth of data and improve the efficiency of obtaining clinical evidence, here, we introduce an automated information extraction model for traditional Chinese medicine (TCM) RCT evidence extraction. Methods: We adopt the Evidence-Bidirectional Encoder Representation from Transformers (Evi-BERT) for automated information extraction, which is combined with rule extraction. Eleven disease types and 48,523 research articles from the China National Knowledge Infrastructure (CNKI), WanFang Data, and VIP databases were selected as the data source for extraction. We then constructed a manually annotated dataset of TCM clinical literature to train the model, including ten evidence elements and 24,244 datapoints. We chose two models, BERT-CRF and BiLSTM-CRF, as the baseline, and compared the training effects with Evi-BERT and Evi-BERT combined with rule expression (RE). Results: We found that Evi-BERT combined with RE achieved the best performance (precision score = 0.926, Recall = 0.952, F1 score = 0.938) and had the best robustness. We totally summarized 113 pieces of rule datasets in the regulation extraction procedure. Our model dramatically expands the amount of data that can be searched and greatly improves efficiency without losing accuracy. Conclusion: Our work provided an intelligent approach to extracting clinical evidence for TCM RCT data. Our model can help physicians reduce the time spent reading journals and rapidly speed up the screening of clinical trial evidence to help generate accurate clinical reference guidelines. Additionally, we hope the structured clinical evidence and structured knowledge extracted from this study will help other researchers build large language models in TCM.

7.
Stud Health Technol Inform ; 316: 834-838, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39176922

RESUMO

Digital individual participant data (IPD) from clinical trials are increasingly distributed for potential scientific reuse. The identification of available IPD, however, requires interpretations of textual data-sharing statements (DSS) in large databases. Recent advancements in computational linguistics include pre-trained language models that promise to simplify the implementation of effective classifiers based on textual inputs. In a subset of 5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers based on domain-specific pre-trained language models reproduce original availability categories as well as manually annotated labels. Typical metrics indicate that classifiers that predicted manual annotations outperformed those that learned to output the original availability categories. This suggests that the textual DSS descriptions contain applicable information that the availability categories do not, and that such classifiers could thus aid the automatic identification of available IPD in large trial databases.


Assuntos
Ensaios Clínicos como Assunto , Disseminação de Informação , Humanos , Processamento de Linguagem Natural , Registros Eletrônicos de Saúde/classificação
8.
JMIR AI ; 3: e52190, 2024 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-39190905

RESUMO

BACKGROUND: Predicting hospitalization from nurse triage notes has the potential to augment care. However, there needs to be careful considerations for which models to choose for this goal. Specifically, health systems will have varying degrees of computational infrastructure available and budget constraints. OBJECTIVE: To this end, we compared the performance of the deep learning, Bidirectional Encoder Representations from Transformers (BERT)-based model, Bio-Clinical-BERT, with a bag-of-words (BOW) logistic regression (LR) model incorporating term frequency-inverse document frequency (TF-IDF). These choices represent different levels of computational requirements. METHODS: A retrospective analysis was conducted using data from 1,391,988 patients who visited emergency departments in the Mount Sinai Health System spanning from 2017 to 2022. The models were trained on 4 hospitals' data and externally validated on a fifth hospital's data. RESULTS: The Bio-Clinical-BERT model achieved higher areas under the receiver operating characteristic curve (0.82, 0.84, and 0.85) compared to the BOW-LR-TF-IDF model (0.81, 0.83, and 0.84) across training sets of 10,000; 100,000; and ~1,000,000 patients, respectively. Notably, both models proved effective at using triage notes for prediction, despite the modest performance gap. CONCLUSIONS: Our findings suggest that simpler machine learning models such as BOW-LR-TF-IDF could serve adequately in resource-limited settings. Given the potential implications for patient care and hospital resource management, further exploration of alternative models and techniques is warranted to enhance predictive performance in this critical domain. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1101/2023.08.07.23293699.

9.
MethodsX ; 13: 102843, 2024 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-39101121

RESUMO

Event of the disastrous scenarios are actively discussed on microblogging platforms like Twitter which can lead to chaotic situations. In the era of machine learning and deep learning, these chaotic situations can be effectively controlled by developing efficient methods and models that can assist in classifying real and fake tweets. In this research article, an efficient method named BERT Embedding based CNN model with RMSProp Optimizer is proposed to effectively classify the tweets related disastrous scenario. Tweet classification is carried out via some of the popular the machine learning algorithms such as logistic regression and decision tree classifiers. Noting the low accuracy of machine learning models, Convolutional Neural Network (CNN) based deep learning model is selected as the primary classification method. CNNs performance is improved via optimization of the parameters with gradient based optimizers. To further elevate accuracy and to capture contextual semantics from the text data, BERT embeddings are included in the proposed model. The performance of proposed method - BERT Embedding based CNN model with RMSProp Optimizer achieved an F1 score of 0.80 and an Accuracy of 0.83. The methodology presented in this research article is comprised of the following key contributions:•Identification of suitable text classification model that can effectively capture complex patterns when dealing with large vocabularies or nuanced language structures in disaster management scenarios.•The method explores the gradient based optimization techniques such as Adam Optimizer, Stochastic Gradient Descent (SGD) Optimizer, AdaGrad, and RMSprop Optimizer to identify the most appropriate optimizer that meets the characteristics of the dataset and the CNN model architecture.•"BERT Embedding based CNN model with RMSProp Optimizer" - a method to classify the disaster tweets and capture semantic representations by leveraging BERT embeddings with appropriate feature selection is presented and models are validated with appropriate comparative analysis.

10.
PeerJ Comput Sci ; 10: e2166, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38983236

RESUMO

Amid the wave of globalization, the phenomenon of cultural amalgamation has surged in frequency, bringing to the fore the heightened prominence of challenges inherent in cross-cultural communication. To address these challenges, contemporary research has shifted its focus to human-computer dialogue. Especially in the educational paradigm of human-computer dialogue, analysing emotion recognition in user dialogues is particularly important. Accurately identify and understand users' emotional tendencies and the efficiency and experience of human-computer interaction and play. This study aims to improve the capability of language emotion recognition in human-computer dialogue. It proposes a hybrid model (BCBA) based on bidirectional encoder representations from transformers (BERT), convolutional neural networks (CNN), bidirectional gated recurrent units (BiGRU), and the attention mechanism. This model leverages the BERT model to extract semantic and syntactic features from the text. Simultaneously, it integrates CNN and BiGRU networks to delve deeper into textual features, enhancing the model's proficiency in nuanced sentiment recognition. Furthermore, by introducing the attention mechanism, the model can assign different weights to words based on their emotional tendencies. This enables it to prioritize words with discernible emotional inclinations for more precise sentiment analysis. The BCBA model has achieved remarkable results in emotion recognition and classification tasks through experimental validation on two datasets. The model has significantly improved both accuracy and F1 scores, with an average accuracy of 0.84 and an average F1 score of 0.8. The confusion matrix analysis reveals a minimal classification error rate for this model. Additionally, as the number of iterations increases, the model's recall rate stabilizes at approximately 0.7. This accomplishment demonstrates the model's robust capabilities in semantic understanding and sentiment analysis and showcases its advantages in handling emotional characteristics in language expressions within a cross-cultural context. The BCBA model proposed in this study provides effective technical support for emotion recognition in human-computer dialogue, which is of great significance for building more intelligent and user-friendly human-computer interaction systems. In the future, we will continue to optimize the model's structure, improve its capability in handling complex emotions and cross-lingual emotion recognition, and explore applying the model to more practical scenarios to further promote the development and application of human-computer dialogue technology.

11.
Network ; : 1-34, 2024 Jul 17.
Artigo em Inglês | MEDLINE | ID: mdl-39015012

RESUMO

Social media networks become an active communication medium for connecting people and delivering new messages. Social media can perform as the primary channel, where the globalized events or instances can be explored. Earlier models are facing the pitfall of noticing the temporal and spatial resolution for enhancing the efficacy. Therefore, in this proposed model, a new event detection approach from social media data is presented. Firstly, the essential data is collected and undergone for pre-processing stage. Further, the Bidirectional Encoder Representations from Transformers (BERT) and Term Frequency Inverse Document Frequency (TF-IDF) are employed for extracting features. Subsequently, the two resultant features are given to the multi-scale and dilated layer present in the detection network of GRU and Res-Bi-LSTM, named as Multi-scale and Dilated Adaptive Hybrid Deep Learning (MDA-HDL) for event detection. Moreover, the MDA-HDL network's parameters are tuned by Improved Gannet Optimization Algorithm (IGOA) to enhance the performance. Finally, the execution of the system is done over the Python platform, where the system is validated and compared with baseline methodologies. The accuracy findings of model acquire as 94.96 for dataset 1 and 96.42 for dataset 2. Hence, the recommended model outperforms with the superior results while detecting the social events.

12.
Diagnostics (Basel) ; 14(13)2024 Jun 27.
Artigo em Inglês | MEDLINE | ID: mdl-39001255

RESUMO

Metastatic breast cancer (MBC) continues to be a leading cause of cancer-related deaths among women. This work introduces an innovative non-invasive breast cancer classification model designed to improve the identification of cancer metastases. While this study marks the initial exploration into predicting MBC, additional investigations are essential to validate the occurrence of MBC. Our approach combines the strengths of large language models (LLMs), specifically the bidirectional encoder representations from transformers (BERT) model, with the powerful capabilities of graph neural networks (GNNs) to predict MBC patients based on their histopathology reports. This paper introduces a BERT-GNN approach for metastatic breast cancer prediction (BG-MBC) that integrates graph information derived from the BERT model. In this model, nodes are constructed from patient medical records, while BERT embeddings are employed to vectorise representations of the words in histopathology reports, thereby capturing semantic information crucial for classification by employing three distinct approaches (namely univariate selection, extra trees classifier for feature importance, and Shapley values to identify the features that have the most significant impact). Identifying the most crucial 30 features out of 676 generated as embeddings during model training, our model further enhances its predictive capabilities. The BG-MBC model achieves outstanding accuracy, with a detection rate of 0.98 and an area under curve (AUC) of 0.98, in identifying MBC patients. This remarkable performance is credited to the model's utilisation of attention scores generated by the LLM from histopathology reports, effectively capturing pertinent features for classification.

13.
J Am Coll Emerg Physicians Open ; 5(4): e13206, 2024 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-39056086

RESUMO

Objective: Patient violence in emergency departments (EDs) may be prevented with proactive mitigation measures targeting potentially violent patients. We aimed to evaluate the effects of two interventions guided by a validated risk-assessment tool. Methods: A prospective interventional study was conducted among patients ≥10 years who visited two EDs in Michigan, USA, from October 2022 to August 2023. During triage, the ED nurses completed the Aggressive Behavior Risk Assessment Tool for EDs (ABRAT-ED) to identify high-risk patients. Following the baseline observational period, interventions were implemented stepwise for the high-risk patients: phase 1 period with signage posting and phase 2 period with a proactive Behavioral Emergency Response Team (BERT) huddle added to the signage posting. Before ED disposition, any violent events and their severities were documented. The data were retrieved retrospectively after the study was completed. Results: Of 77,424 evaluable patients, 546 had ≥1 violent event. The violent event rates were 0.93%, 0.68%, and 0.62% for baseline, phase 1, and phase 2, respectively. The relative risk of violent events for phase 1 compared to the baseline was 0.73 (95% confidence interval [CI]: 0.59‒0.90; p = 0.003). The relative risk for phase 2 compared to phase 1 was 0.92 (95% CI: 0.76‒1.12; p = 0.418). Conclusion: The use of signage posting as a persistent visual cue for high-risk patients identified by ABRAT-ED appears to be effective in reducing the overall violent event rates. However, adding proactive BERT huddle to signage posting showed no significant reduction in the violent event rates compared to signage posting alone.

14.
Am J Epidemiol ; 2024 Jul 26.
Artigo em Inglês | MEDLINE | ID: mdl-39060160

RESUMO

Fall-related injuries (FRIs) are a major cause of hospitalizations among older patients, but identifying them in unstructured clinical notes poses challenges for large-scale research. In this study, we developed and evaluated Natural Language Processing (NLP) models to address this issue. We utilized all available clinical notes from the Mass General Brigham for 2,100 older adults, identifying 154,949 paragraphs of interest through automatic scanning for FRI-related keywords. Two clinical experts directly labeled 5,000 paragraphs to generate benchmark-standard labels, while 3,689 validated patterns were annotated, indirectly labeling 93,157 paragraphs as validated-standard labels. Five NLP models, including vanilla BERT, RoBERTa, Clinical-BERT, Distil-BERT, and SVM, were trained using 2,000 benchmark paragraphs and all validated paragraphs. BERT-based models were trained in three stages: Masked Language Modeling, General Boolean Question Answering (QA), and QA for FRI. For validation, 500 benchmark paragraphs were used, and the remaining 2,500 for testing. Performance metrics (precision, recall, F1 scores, Area Under ROC [AUROC] or Precision-Recall [AUPR] curves) were employed by comparison, with RoBERTa showing the best performance. Precision was 0.90 [0.88-0.91], recall [0.90-0.93], F1 score 0.90 [0.89-0.92], AUROC and AUPR curves of 0.96 [0.95-0.97]. These NLP models accurately identify FRIs from unstructured clinical notes, potentially enhancing clinical notes-based research efficiency.

15.
J Hazard Mater ; 476: 135114, 2024 Sep 05.
Artigo em Inglês | MEDLINE | ID: mdl-38986414

RESUMO

Toxicity identification plays a key role in maintaining human health, as it can alert humans to the potential hazards caused by long-term exposure to a wide variety of chemical compounds. Experimental methods for determining toxicity are time-consuming, and costly, while computational methods offer an alternative for the early identification of toxicity. For example, some classical ML and DL methods, which demonstrate excellent performance in toxicity prediction. However, these methods also have some defects, such as over-reliance on artificial features and easy overfitting, etc. Proposing novel models with superior prediction performance is still an urgent task. In this study, we propose a motifs-level graph-based multi-view pretraining language model, called 3MTox, for toxicity identification. The 3MTox model uses Bidirectional Encoder Representations from Transformers (BERT) as the backbone framework, and a motif graph as input. The results of extensive experiments showed that our 3MTox model achieved state-of-the-art performance on toxicity benchmark datasets and outperformed the baseline models considered. In addition, the interpretability of the model ensures that the it can quickly and accurately identify toxicity sites in a given molecule, thereby contributing to the determination of the status of toxicity and associated analyses. We think that the 3MTox model is among the most promising tools that are currently available for toxicity identification.


Assuntos
Modelos Químicos , Algoritmos
16.
J Integr Bioinform ; 21(2)2024 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-38960869

RESUMO

Cancer immunology offers a new alternative to traditional cancer treatments, such as radiotherapy and chemotherapy. One notable alternative is the development of personalized vaccines based on cancer neoantigens. Moreover, Transformers are considered a revolutionary development in artificial intelligence with a significant impact on natural language processing (NLP) tasks and have been utilized in proteomics studies in recent years. In this context, we conducted a systematic literature review to investigate how Transformers are applied in each stage of the neoantigen detection process. Additionally, we mapped current pipelines and examined the results of clinical trials involving cancer vaccines.


Assuntos
Antígenos de Neoplasias , Neoplasias , Humanos , Antígenos de Neoplasias/imunologia , Neoplasias/imunologia , Vacinas Anticâncer/imunologia , Processamento de Linguagem Natural , Inteligência Artificial
17.
BMC Med Inform Decis Mak ; 24(1): 205, 2024 Jul 24.
Artigo em Inglês | MEDLINE | ID: mdl-39049015

RESUMO

BACKGROUND: Biomedical Relation Extraction (RE) is essential for uncovering complex relationships between biomedical entities within text. However, training RE classifiers is challenging in low-resource biomedical applications with few labeled examples. METHODS: We explore the potential of Shortest Dependency Paths (SDPs) to aid biomedical RE, especially in situations with limited labeled examples. In this study, we suggest various approaches to employ SDPs when creating word and sentence representations under supervised, semi-supervised, and in-context-learning settings. RESULTS: Through experiments on three benchmark biomedical text datasets, we find that incorporating SDP-based representations enhances the performance of RE classifiers. The improvement is especially notable when working with small amounts of labeled data. CONCLUSION: SDPs offer valuable insights into the complex sentence structure found in many biomedical text passages. Our study introduces several straightforward techniques that, as demonstrated experimentally, effectively enhance the accuracy of RE classifiers.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural , Humanos , Mineração de Dados/métodos , Aprendizado de Máquina
18.
PeerJ Comput Sci ; 10: e2058, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38855259

RESUMO

Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. One of the effective ways for knowledge graph completion is knowledge graph embedding. However, existing embedding methods usually focus on developing deeper and more complex neural networks, or leveraging additional information, which inevitably increases computational complexity and is unfriendly to real-time applications. In this article, we propose an effective BERT-enhanced shallow neural network model for knowledge graph completion named ShallowBKGC. Specifically, given an entity pair, we first apply the pre-trained language model BERT to extract text features of head and tail entities. At the same time, we use the embedding layer to extract structure features of head and tail entities. Then the text and structure features are integrated into one entity-pair representation via average operation followed by a non-linear transformation. Finally, based on the entity-pair representation, we calculate probability of each relation through multi-label modeling to predict relations for the given entity pair. Experimental results on three benchmark datasets show that our model achieves a superior performance in comparison with baseline methods. The source code of this article can be obtained from https://github.com/Joni-gogogo/ShallowBKGC.

19.
Heliyon ; 10(11): e32279, 2024 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-38912449

RESUMO

Early cancer detection and treatment depend on the discovery of specific genes that cause cancer. The classification of genetic mutations was initially done manually. However, this process relies on pathologists and can be a time-consuming task. Therefore, to improve the precision of clinical interpretation, researchers have developed computational algorithms that leverage next-generation sequencing technologies for automated mutation analysis. This paper utilized four deep learning classification models with training collections of biomedical texts. These models comprise bidirectional encoder representations from transformers for Biomedical text mining (BioBERT), a specialized language model implemented for biological contexts. Impressive results in multiple tasks, including text classification, language inference, and question answering, can be obtained by simply adding an extra layer to the BioBERT model. Moreover, bidirectional encoder representations from transformers (BERT), long short-term memory (LSTM), and bidirectional LSTM (BiLSTM) have been leveraged to produce very good results in categorizing genetic mutations based on textual evidence. The dataset used in the work was created by Memorial Sloan Kettering Cancer Center (MSKCC), which contains several mutations. Furthermore, this dataset poses a major classification challenge in the Kaggle research prediction competitions. In carrying out the work, three challenges were identified: enormous text length, biased representation of the data, and repeated data instances. Based on the commonly used evaluation metrics, the experimental results show that the BioBERT model outperforms other models with an F1 score of 0.87 and 0.850 MCC, which can be considered as improved performance compared to similar results in the literature that have an F1 score of 0.70 achieved with the BERT model.

20.
J Cheminform ; 16(1): 71, 2024 Jun 19.
Artigo em Inglês | MEDLINE | ID: mdl-38898528

RESUMO

Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction SCIENTIFIC CONTRIBUTION: The 2-encoder pretraining is proposed by focusing on the lower dependency of symbols to the contextual environment in a SMILES than one in a natural language sentence and the corresponding of one compound to multiple SMILES sequences. The model pretrained with 2-encoder shows higher robustness in tasks of molecular properties prediction compared to BERT which is adept at natural language.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA