Pesquisa | Biblioteca Virtual em Saúde Fiocruz

1.

Recognition of cyanobacteria promoters via Siamese network-based contrastive learning under novel non-promoter generation.

Yang, Guang; Li, Jianing; Hu, Jinlu; Shi, Jian-Yu.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38701419

RESUMO

It is a vital step to recognize cyanobacteria promoters on a genome-wide scale. Computational methods are promising to assist in difficult biological identification. When building recognition models, these methods rely on non-promoter generation to cope with the lack of real non-promoters. Nevertheless, the factitious significant difference between promoters and non-promoters causes over-optimistic prediction. Moreover, designed for E. coli or B. subtilis, existing methods cannot uncover novel, distinct motifs among cyanobacterial promoters. To address these issues, this work first proposes a novel non-promoter generation strategy called phantom sampling, which can eliminate the factitious difference between promoters and generated non-promoters. Furthermore, it elaborates a novel promoter prediction model based on the Siamese network (SiamProm), which can amplify the hidden difference between promoters and non-promoters through a joint characterization of global associations, upstream and downstream contexts, and neighboring associations w.r.t. k-mer tokens. The comparison with state-of-the-art methods demonstrates the superiority of our phantom sampling and SiamProm. Both comprehensive ablation studies and feature space illustrations also validate the effectiveness of the Siamese network and its components. More importantly, SiamProm, upon our phantom sampling, finds a novel cyanobacterial promoter motif ('GCGATCGC'), which is palindrome-patterned, content-conserved, but position-shifted.

Assuntos

Cianobactérias , Regiões Promotoras Genéticas , Cianobactérias/genética , Biologia Computacional/métodos , Algoritmos

2.

MiRAGE: mining relationships for advanced generative evaluation in drug repositioning.

Hassanali Aragh, Aria; Givehchian, Pegah; Moslemi Amirani, Razieh; Masumshah, Raziyeh; Eslahchi, Changiz.

Brief Bioinform ; 25(4)2024 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-39038932

RESUMO

MOTIVATION: Drug repositioning, the identification of new therapeutic uses for existing drugs, is crucial for accelerating drug discovery and reducing development costs. Some methods rely on heterogeneous networks, which may not fully capture the complex relationships between drugs and diseases. However, integrating diverse biological data sources offers promise for discovering new drug-disease associations (DDAs). Previous evidence indicates that the combination of information would be conducive to the discovery of new DDAs. However, the challenge lies in effectively integrating different biological data sources to identify the most effective drugs for a certain disease based on drug-disease coupled mechanisms. RESULTS: In response to this challenge, we present MiRAGE, a novel computational method for drug repositioning. MiRAGE leverages a three-step framework, comprising negative sampling using hard negative mining, classification employing random forest models, and feature selection based on feature importance. We evaluate MiRAGE on multiple benchmark datasets, demonstrating its superiority over state-of-the-art algorithms across various metrics. Notably, MiRAGE consistently outperforms other methods in uncovering novel DDAs. Case studies focusing on Parkinson's disease and schizophrenia showcase MiRAGE's ability to identify top candidate drugs supported by previous studies. Overall, our study underscores MiRAGE's efficacy and versatility as a computational tool for drug repositioning, offering valuable insights for therapeutic discoveries and addressing unmet medical needs.

Assuntos

Algoritmos , Mineração de Dados , Reposicionamento de Medicamentos , Reposicionamento de Medicamentos/métodos , Mineração de Dados/métodos , Humanos , Biologia Computacional/métodos , Esquizofrenia/tratamento farmacológico , Doença de Parkinson/tratamento farmacológico , Descoberta de Drogas/métodos

3.

Predicting microbe-drug associations with structure-enhanced contrastive learning and self-paced negative sampling strategy.

Tian, Zhen; Yu, Yue; Fang, Haichuan; Xie, Weixin; Guo, Maozu.

Brief Bioinform ; 24(2)2023 03 19.

Artigo em Inglês | MEDLINE | ID: mdl-36715986

RESUMO

MOTIVATION: Predicting the associations between human microbes and drugs (MDAs) is one critical step in drug development and precision medicine areas. Since discovering these associations through wet experiments is time-consuming and labor-intensive, computational methods have already been an effective way to tackle this problem. Recently, graph contrastive learning (GCL) approaches have shown great advantages in learning the embeddings of nodes from heterogeneous biological graphs (HBGs). However, most GCL-based approaches don't fully capture the rich structure information in HBGs. Besides, fewer MDA prediction methods could screen out the most informative negative samples for effectively training the classifier. Therefore, it still needs to improve the accuracy of MDA predictions. RESULTS: In this study, we propose a novel approach that employs the Structure-enhanced Contrastive learning and Self-paced negative sampling strategy for Microbe-Drug Association predictions (SCSMDA). Firstly, SCSMDA constructs the similarity networks of microbes and drugs, as well as their different meta-path-induced networks. Then SCSMDA employs the representations of microbes and drugs learned from meta-path-induced networks to enhance their embeddings learned from the similarity networks by the contrastive learning strategy. After that, we adopt the self-paced negative sampling strategy to select the most informative negative samples to train the MLP classifier. Lastly, SCSMDA predicts the potential microbe-drug associations with the trained MLP classifier. The embeddings of microbes and drugs learning from the similarity networks are enhanced with the contrastive learning strategy, which could obtain their discriminative representations. Extensive results on three public datasets indicate that SCSMDA significantly outperforms other baseline methods on the MDA prediction task. Case studies for two common drugs could further demonstrate the effectiveness of SCSMDA in finding novel MDA associations. AVAILABILITY: The source code is publicly available on GitHub https://github.com/Yue-Yuu/SCSMDA-master.

Assuntos

Desenvolvimento de Medicamentos , Medicina de Precisão , Humanos , Software

4.

Benchmarks in antimicrobial peptide prediction are biased due to the selection of negative data.

Sidorczuk, Katarzyna; Gagat, Przemyslaw; Pietluch, Filip; Kala, Jakub; Rafacz, Dominik; Bakala, Laura; Slowik, Jadwiga; Kolenda, Rafal; Rödiger, Stefan; Fingerhut, Legana C H W; Cooke, Ira R; Mackiewicz, Pawel; Burdukiewicz, Michal.

Brief Bioinform ; 23(5)2022 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-35988923

RESUMO

Antimicrobial peptides (AMPs) are a heterogeneous group of short polypeptides that target not only microorganisms but also viruses and cancer cells. Due to their lower selection for resistance compared with traditional antibiotics, AMPs have been attracting the ever-growing attention from researchers, including bioinformaticians. Machine learning represents the most cost-effective method for novel AMP discovery and consequently many computational tools for AMP prediction have been recently developed. In this article, we investigate the impact of negative data sampling on model performance and benchmarking. We generated 660 predictive models using 12 machine learning architectures, a single positive data set and 11 negative data sampling methods; the architectures and methods were defined on the basis of published AMP prediction software. Our results clearly indicate that similar training and benchmark data set, i.e. produced by the same or a similar negative data sampling method, positively affect model performance. Consequently, all the benchmark analyses that have been performed for AMP prediction models are significantly biased and, moreover, we do not know which model is the most accurate. To provide researchers with reliable information about the performance of AMP predictors, we also created a web server AMPBenchmark for fair model benchmarking. AMPBenchmark is available at http://BioGenies.info/AMPBenchmark.

Assuntos

Peptídeos Antimicrobianos , Benchmarking , Antibacterianos , Peptídeos/química

5.

Contrastive Speaker Representation Learning with Hard Negative Sampling for Speaker Recognition.

Go, Changhwan; Lee, Young Han; Kim, Taewoo; Park, Nam In; Chun, Chanjun.

Sensors (Basel) ; 24(19)2024 Sep 25.

Artigo em Inglês | MEDLINE | ID: mdl-39409253

RESUMO

Speaker recognition is a technology that identifies the speaker in an input utterance by extracting speaker-distinguishable features from the speech signal. Speaker recognition is used for system security and authentication; therefore, it is crucial to extract unique features of the speaker to achieve high recognition rates. Representative methods for extracting these features include a classification approach, or utilizing contrastive learning to learn the speaker relationship between representations and then using embeddings extracted from a specific layer of the model. This paper introduces a framework for developing robust speaker recognition models through contrastive learning. This approach aims to minimize the similarity to hard negative samples-those that are genuine negatives, but have extremely similar features to the positives, leading to potential mistaken. Specifically, our proposed method trains the model by estimating hard negative samples within a mini-batch during contrastive learning, and then utilizes a cross-attention mechanism to determine speaker agreement for pairs of utterances. To demonstrate the effectiveness of our proposed method, we compared the performance of a deep learning model trained with a conventional loss function utilized in speaker recognition with that of a deep learning model trained using our proposed method, as measured by the equal error rate (EER), an objective performance metric. Our results indicate that when trained with the voxceleb2 dataset, the proposed method achieved an EER of 0.98% on the voxceleb1-E dataset and 1.84% on the voxceleb1-H dataset.

Assuntos

Fala , Humanos , Fala/fisiologia , Algoritmos , Aprendizado Profundo , Reconhecimento Automatizado de Padrão/métodos , Interface para o Reconhecimento da Fala

6.

Association filtering and generative adversarial networks for predicting lncRNA-associated disease.

Zhong, Hua; Luo, Jing; Tang, Lin; Liao, Shicheng; Lu, Zhonghao; Lin, Guoliang; Murphy, Robert W; Liu, Lin.

BMC Bioinformatics ; 24(1): 234, 2023 Jun 05.

Artigo em Inglês | MEDLINE | ID: mdl-37277721

RESUMO

BACKGROUND: Long non-coding RNA (lncRNA) closely associates with numerous biological processes, and with many diseases. Therefore, lncRNA-disease association prediction helps obtain relevant biological information and understand pathogenesis, and thus better diagnose preventable diseases. RESULTS: Herein, we offer the LDAF_GAN method for predicting lncRNA-associated disease based on association filtering and generative adversarial networks. Experimentation used two types of data: lncRNA-disease associated data without lncRNA sequence features, and fused lncRNA sequence features. LDAF_GAN uses a generator and discriminator, and differs from the original GAN by the addition of a filtering operation and negative sampling. Filtering allows the generator output to filter out unassociated diseases before being fed into the discriminator. Thus, the results generated by the model focuses only on lncRNAs associated with disease. Negative sampling takes a portion of disease terms with 0 from the association matrix as negative samples, which are assumed to be unassociated with lncRNA. A regular term is added to the loss function to avoid producing a vector with all values of 1, which can fool the discriminator. Thus, the model requires that generated positive samples are close to 1, and negative samples are close to 0. The model achieved a superior fitting effect; LDAF_GAN had superior performance in predicting fivefold cross-validations on the two datasets with AUC values of 0.9265 and 0.9278, respectively. In the case study, LDAF_GAN predicted disease association for six lncRNAs-H19, MALAT1, XIST, ZFAS1, UCA1, and ZEB1-AS1-and with the top ten predictions of 100%, 80%, 90%, 90%, 100%, and 90%, respectively, which were reported by previous studies. CONCLUSION: LDAF_GAN efficiently predicts the potential association of existing lncRNAs and the potential association of new lncRNAs with diseases. The results of fivefold cross-validation, tenfold cross-validation, and case studies suggest that the model has great predictive potential for lncRNA-disease association prediction.

Assuntos

RNA Longo não Codificante , RNA Longo não Codificante/genética , Algoritmos , Biologia Computacional/métodos

7.

A Knowledge-Grounded Task-Oriented Dialogue System with Hierarchical Structure for Enhancing Knowledge Selection.

Lee, Hayoung; Jeong, Okran.

Sensors (Basel) ; 23(2)2023 Jan 06.

Artigo em Inglês | MEDLINE | ID: mdl-36679481

RESUMO

For a task-oriented dialogue system to provide appropriate answers to and services for users' questions, it is necessary for it to be able to utilize knowledge related to the topic of the conversation. Therefore, the system should be able to select the most appropriate knowledge snippet from the knowledge base, where external unstructured knowledge is used to respond to user requests that cannot be solved by the internal knowledge addressed by the database or application programming interface. Therefore, this paper constructs a three-step knowledge-grounded task-oriented dialogue system with knowledge-seeking-turn detection, knowledge selection, and knowledge-grounded generation. In particular, we propose a hierarchical structure of domain-classification, entity-extraction, and snippet-ranking tasks by subdividing the knowledge selection step. Each task is performed through the pre-trained language model with advanced techniques to finally determine the knowledge snippet to be used to generate a response. Furthermore, the domain and entity information obtained because of the previous task is used as knowledge to reduce the search range of candidates, thereby improving the performance and efficiency of knowledge selection and proving it through experiments.

Assuntos

Comunicação , Idioma , Bases de Conhecimento , Processamento de Linguagem Natural , Software

8.

Deep Metric Learning Using Negative Sampling Probability Annealing.

Kertész, Gábor.

Sensors (Basel) ; 22(19)2022 Oct 06.

Artigo em Inglês | MEDLINE | ID: mdl-36236678

RESUMO

Multiple studies have concluded that the selection of input samples is key for deep metric learning. For triplet networks, the selection of the anchor, positive, and negative pairs is referred to as triplet mining. The selection of the negatives is considered the be the most complicated task, due to a large number of possibilities. The goal is to select a negative that results in a positive triplet loss; however, there are multiple approaches for this-semi-hard negative mining or hardest mining are well-known in addition to random selection. Since its introduction, semi-hard mining was proven to outperform other negative mining techniques; however, in recent years, the selection of the so-called hardest negative has shown promising results in different experiments. This paper introduces a novel negative sampling solution based on dynamic policy switching, referred to as negative sampling probability annealing, which aims to exploit the positives of all approaches. Results are validated on an experimental synthetic dataset using cluster-analysis methods; finally, the discriminative abilities of trained models are measured on real-life data.

Assuntos

Algoritmos , Probabilidade

9.

Predicting Drug-Target Interactions with Electrotopological State Fingerprints and Amphiphilic Pseudo Amino Acid Composition.

Wang, Cheng; Wang, Wenyan; Lu, Kun; Zhang, Jun; Chen, Peng; Wang, Bing.

Int J Mol Sci ; 21(16)2020 Aug 08.

Artigo em Inglês | MEDLINE | ID: mdl-32784497

RESUMO

The task of drug-target interaction (DTI) prediction plays important roles in drug development. The experimental methods in DTIs are time-consuming, expensive and challenging. To solve these problems, machine learning-based methods are introduced, which are restricted by effective feature extraction and negative sampling. In this work, features with electrotopological state (E-state) fingerprints for drugs and amphiphilic pseudo amino acid composition (APAAC) for target proteins are tested. E-state fingerprints are extracted based on both molecular electronic and topological features with the same metric. APAAC is an extension of amino acid composition (AAC), which is calculated based on hydrophilic and hydrophobic characters to construct sequence order information. Using the combination of these feature pairs, the prediction model is established by support vector machines. In order to enhance the effectiveness of features, a distance-based negative sampling is proposed to obtain reliable negative samples. It is shown that the prediction results of area under curve for Receiver Operating Characteristic (AUC) are above 98.5% for all the three datasets in this work. The comparison of state-of-the-art methods demonstrates the effectiveness and efficiency of proposed method, which will be helpful for further drug development.

Assuntos

Aminoácidos/metabolismo , Desenvolvimento de Medicamentos , Eletroquímica/métodos , Tensoativos/química , Área Sob a Curva , Curva ROC , Padrões de Referência

10.

How to balance the bioinformatics data: pseudo-negative sampling.

Zhang, Yongqing; Qiao, Shaojie; Lu, Rongzhao; Han, Nan; Liu, Dingxiang; Zhou, Jiliu.

BMC Bioinformatics ; 20(Suppl 25): 695, 2019 Dec 24.

Artigo em Inglês | MEDLINE | ID: mdl-31874622

RESUMO

BACKGROUND: Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. RESULTS: In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. CONCLUSIONS: To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew's Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.

Assuntos

Biologia Computacional/métodos , Sensibilidade e Especificidade

11.

Towards a better negative sampling strategy for dynamic graphs.

Gao, Kuang; Liu, Chuang; Wu, Jia; Du, Bo; Hu, Wenbin.

Neural Netw ; 173: 106175, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38387201

RESUMO

As dynamic graphs have become indispensable in numerous fields due to their capacity to represent evolving relationships over time, there has been a concomitant increase in the development of Temporal Graph Neural Networks (TGNNs). When training TGNNs for dynamic graph link prediction, the commonly used negative sampling method often produces starkly contrasting samples, which can lead the model to overfit these pronounced differences and compromise its ability to generalize effectively to new data. To address this challenge, we introduce an innovative negative sampling approach named Enhanced Negative Sampling (ENS). This strategy takes into account two pervasive traits observed in dynamic graphs: (1) Historical dependence, indicating that nodes frequently reestablish connections they held in the past, and (2) Temporal proximity preference, which posits that nodes are more inclined to connect with those they have recently interacted with. Specifically, our technique employs a designed scheduling function to strategically control the progression of difficulty of the negative samples throughout the training. This ensures that the training progresses in a balanced manner, becoming incrementally challenging, and thereby enhancing TGNNs' proficiency in predicting links within dynamic graphs. In our empirical evaluation across multiple datasets, we discerned that our ENS, when integrated as a modular component, notably augments the performance of four SOTA baselines. Additionally, we further investigated the applicability of ENS in handling dynamic graphs of varied attributes. Our code is available at https://github.com/qqaazxddrr/ENS.

Assuntos

Redes Neurais de Computação

12.

M²ixKG: Mixing for harder negative samples in knowledge graph.

Che, Feihu; Tao, Jianhua.

Neural Netw ; 177: 106358, 2024 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-38805795

RESUMO

Knowledge graph embedding (KGE) involves mapping entities and relations to low-dimensional dense embeddings, enabling a wide range of real-world applications. The mapping is achieved via distinguishing the positive and negative triplets in knowledge graphs. Therefore, how to design high-quality negative triplets is critical in the effectiveness of KEG models. Existing KGE models face challenges in generating high-quality negative triplets. Some models employ simple static distributions, i.e. uniform or Bernoulli distribution, and it is difficult for these methods to be trained distinguishably because of the sampled uninformative negative triplets. Furthermore, current methods are confined to constructing negative triplets from existing entities within the knowledge graph, limiting their ability to explore harder negatives. We introduce a novel mixing strategy in knowledge graphs called M2ixKG. M2ixKG adopts mixing operation in generating harder negative samples from two aspects: one is mixing among the heads and tails in triplets with the same relation to strengthen the robustness and generalization of the entity embeddings; the other is mixing the negatives with high scores to generate harder negatives. Our experiments, utilizing three datasets and four classical score functions, highlight the exceptional performance of M2ixKG in comparison to previous negative sampling algorithms.

Assuntos

Algoritmos , Redes Neurais de Computação , Conhecimento , Humanos

13.

Hypergraph contrastive attention networks for hyperedge prediction with negative samples evaluation.

Wang, Junbo; Chen, Jianrui; Wang, Zhihui; Gong, Maoguo.

Neural Netw ; 181: 106807, 2024 Oct 19.

Artigo em Inglês | MEDLINE | ID: mdl-39447434

RESUMO

Hyperedge prediction aims to predict common relations among multiple nodes that will occur in the future or remain undiscovered in the current hypergraph. It is traditionally modeled as a classification task, which performs hypergraph feature learning and classifies the target samples as either present or absent. However, these approaches involve two issues: (i) in hyperedge feature learning, they fail to measure the influence of nodes on the hyperedges that include them and the neighboring hyperedges, and (ii) in the binary classification task, the quality of the generated negative samples directly impacts the prediction results. To this end, we propose a Hypergraph Contrastive Attention Network (HCAN) model for hyperedge prediction. Inspired by the brain organization, HCAN considers the influence of hyperedges with different orders through the order propagation attention mechanism. It also utilizes the contrastive mechanism to measure the reliability of attention effectively. Furthermore, we design a negative sample generator to produce three different types of negative samples. We evaluate the impact of various negative samples on the model and analyze the problems of binary classification modeling. The effectiveness of HCAN in hyperedge prediction is validated by experimentally comparing 12 baselines on 9 datasets. Our implementations will be publicly available at https://github.com/jianruichen/HCAN.

14.

Medical prediction from missing data with max-minus negative regularized dropout.

Hu, Lvhui; Cheng, Xiaoen; Wen, Chuanbiao; Ren, Yulan.

Front Neurosci ; 17: 1221970, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37521692

RESUMO

Missing data is a naturally common problem faced in medical research. Imputation is a widely used technique to alleviate this problem. Unfortunately, the inherent uncertainty of imputation would make the model overfit the observed data distribution, which has a negative impact on the model generalization performance. R-Drop is a powerful technique to regularize the training of deep neural networks. However, it fails to differentiate the positive and negative samples, which prevents the model from learning robust representations. To handle this problem, we propose a novel negative regularization enhanced R-Drop scheme to boost performance and generalization ability, particularly in the context of missing data. The negative regularization enhanced R-Drop additionally forces the output distributions of positive and negative samples to be inconsistent with each other. Especially, we design a new max-minus negative sampling technique that uses the maximum in-batch values to minus the mini-batch to yield the negative samples to provide sufficient diversity for the model. We test the resulting max-minus negative regularized dropout method on three real-world medical prediction datasets, including both missing and complete cases, to show the effectiveness of the proposed method.

15.

A data-centric way to improve entity linking in knowledge-based question answering.

Liu, Shuo; Zhou, Gang; Xia, Yi; Wu, Hao; Li, Zhufeng.

PeerJ Comput Sci ; 9: e1233, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37346650

RESUMO

Entity linking in knowledge-based question answering (KBQA) is intended to construct a mapping relation between a mention in a natural language question and an entity in the knowledge base. Most research in entity linking focuses on long text, but entity linking in open domain KBQA is more concerned with short text. Many recent models have tried to extract the features of raw data by adjusting the neural network structure. However, the models only perform well with several datasets. We therefore concentrate on the data rather than the model itself and created a model DME (Domain information Mining and Explicit expressing) to extract domain information from short text and append it to the data. The entity linking model will be enhanced by training with DME-processed data. Besides, we also developed a novel negative sampling approach to make the model more robust. We conducted experiments using the large Chinese open source benchmark KgCLUE to assess model performance with DME-processed data. The experiments showed that our approach can improve entity linking in the baseline models without the need to change their structure and our approach is demonstrably transferable to other datasets.

16.

Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature.

Xie, Weixin; Fan, Kunjie; Zhang, Shijun; Li, Lang.

J Biomed Semantics ; 14(1): 5, 2023 05 30.

Artigo em Inglês | MEDLINE | ID: mdl-37248476

RESUMO

BACKGROUND: Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper. RESULTS: PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively. CONCLUSIONS: By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.

Assuntos

Aprendizado Profundo , Armazenamento e Recuperação da Informação , Algoritmos , Interações Medicamentosas , PubMed

17.

Integrated Random Negative Sampling and Uncertainty Sampling in Active Learning Improve Clinical Drug Safety Drug-Drug Interaction Information Retrieval.

Xie, Weixin; Wang, Limei; Cheng, Qi; Wang, Xueying; Wang, Ying; Bi, Hongyuan; He, Bo; Feng, Weixing.

Front Pharmacol ; 11: 582470, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-34017245

RESUMO

Clinical drug-drug interactions (DDIs) have been a major cause for not only medical error but also adverse drug events (ADEs). The published literature on DDI clinical toxicity continues to grow significantly, and high-performance DDI information retrieval (IR) text mining methods are in high demand. The effectiveness of IR and its machine learning (ML) algorithm depends on the availability of a large amount of training and validation data that have been manually reviewed and annotated. In this study, we investigated how active learning (AL) might improve ML performance in clinical safety DDI IR analysis. We recognized that a direct application of AL would not address several primary challenges in DDI IR from the literature. For instance, the vast majority of abstracts in PubMed will be negative, existing positive and negative labeled samples do not represent the general sample distributions, and potentially biased samples may arise during uncertainty sampling in an AL algorithm. Therefore, we developed several novel sampling and ML schemes to improve AL performance in DDI IR analysis. In particular, random negative sampling was added as a part of AL since it has no expanse in the manual data label. We also used two ML algorithms in an AL process to differentiate random negative samples from manually labeled negative samples, and updated both the training and validation samples during the AL process to avoid or reduce biased sampling. Two supervised ML algorithms, support vector machine (SVM) and logistic regression (LR), were used to investigate the consistency of our proposed AL algorithm. Because the ultimate goal of clinical safety DDI IR is to retrieve all DDI toxicity-relevant abstracts, a recall rate of 0.99 was set in developing the AL methods. When we used our newly proposed AL method with SVM, the precision in differentiating the positive samples from manually labeled negative samples improved from 0.45 in the first round to 0.83 in the second round, and the precision in differentiating the positive samples from random negative samples improved from 0.70 to 0.82 in the first and second rounds, respectively. When our proposed AL method was used with LR, the improvements in precision followed a similar trend. However, the other AL algorithms tested did not show improved precision largely because of biased samples caused by the uncertainty sampling or differences between training and validation data sets.

18.

A Novel Computational Model for Predicting microRNA-Disease Associations Based on Heterogeneous Graph Convolutional Networks.

Li, Chunyan; Liu, Hongju; Hu, Qian; Que, Jinlong; Yao, Junfeng.

Cells ; 8(9)2019 08 26.

Artigo em Inglês | MEDLINE | ID: mdl-31455028

RESUMO

Identifying the interactions between disease and microRNA (miRNA) can accelerate drugs development, individualized diagnosis, and treatment for various human diseases. However, experimental methods are time-consuming and costly. So computational approaches to predict latent miRNA-disease interactions are eliciting increased attention. But most previous studies have mainly focused on designing complicated similarity-based methods to predict latent interactions between miRNAs and diseases. In this study, we propose a novel computational model, termed heterogeneous graph convolutional network for miRNA-disease associations (HGCNMDA), which is based on known human protein-protein interaction (PPI) and integrates four biological networks: miRNA-disease, miRNA-gene, disease-gene, and PPI network. HGCNMDA achieved reliable performance using leave-one-out cross-validation (LOOCV). HGCNMDA is then compared to three state-of-the-art algorithms based on five-fold cross-validation. HGCNMDA achieves an AUC of 0.9626 and an average precision of 0.9660, respectively, which is ahead of other competitive algorithms. We further analyze the top-10 unknown interactions between miRNA and disease. In summary, HGCNMDA is a useful computational model for predicting miRNA-disease interactions.

Assuntos

Biologia Computacional/métodos , Predisposição Genética para Doença , MicroRNAs/genética , Algoritmos , Área Sob a Curva , Simulação por Computador , Bases de Dados Genéticas , Estudos de Associação Genética , Humanos , Mapas de Interação de Proteínas

19.

Training host-pathogen protein-protein interaction predictors.

Basit, Abdul Hannan; Abbasi, Wajid Arshad; Asif, Amina; Gull, Sadaf; Minhas, Fayyaz Ul Amir Afsar.

J Bioinform Comput Biol ; 16(4): 1850014, 2018 08.

Artigo em Inglês | MEDLINE | ID: mdl-30060698

RESUMO

Detection of protein-protein interactions (PPIs) plays a vital role in molecular biology. Particularly, pathogenic infections are caused by interactions of host and pathogen proteins. It is important to identify host-pathogen interactions (HPIs) to discover new drugs to counter infectious diseases. Conventional wet lab PPI detection techniques have limitations in terms of cost and large-scale application. Hence, computational approaches are developed to predict PPIs. This study aims to develop machine learning models to predict inter-species PPIs with a special interest in HPIs. Specifically, we focus on seeking answers to three questions that arise while developing an HPI predictor: (1) How should negative training examples be selected? (2) Does assigning sample weights to individual negative examples based on their similarity to positive examples improve generalization performance? and, (3) What should be the size of negative samples as compared to the positive samples during training and evaluation? We compare two available methods for negative sampling: random versus DeNovo sampling and our experiments show that DeNovo sampling offers better accuracy. However, our experiments also show that generalization performance can be improved further by using a soft DeNovo approach that assigns sample weights to negative examples inversely proportional to their similarity to known positive examples during training. Based on our findings, we have also developed an HPI predictor called HOPITOR (Host-Pathogen Interaction Predictor) that can predict interactions between human and viral proteins. The HOPITOR web server can be accessed at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#HoPItor .

Assuntos

Biologia Computacional/métodos , Interações Hospedeiro-Patógeno/fisiologia , Mapeamento de Interação de Proteínas/métodos , Software , Proteínas Virais/metabolismo , Área Sob a Curva , Simulação por Computador , Bases de Dados de Proteínas , Internet , Aprendizado de Máquina , Distribuição Aleatória , Fator de Transcrição STAT1/metabolismo , Fator de Transcrição STAT2/metabolismo

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA