RESUMO
Subcellular localization of messenger ribonucleic acid (mRNA) is a universal mechanism for precise and efficient control of the translation process. Although many computational methods have been constructed by researchers for predicting mRNA subcellular localization, very few of these computational methods have been designed to predict subcellular localization with multiple localization annotations, and their generalization performance could be improved. In this study, the prediction model MSlocPRED was constructed to identify multi-label mRNA subcellular localization. First, the preprocessed Dataset 1 and Dataset 2 are transformed into the form of images. The proposed MDNDO-SMDU resampling technique is then used to balance the number of samples in each category in the training dataset. Finally, deep transfer learning was used to construct the predictive model MSlocPRED to identify subcellular localization for 16 classes (Dataset 1) and 18 classes (Dataset 2). The results of comparative tests of different resampling techniques show that the resampling technique proposed in this study is more effective in preprocessing for subcellular localization. The prediction results of the datasets constructed by intercepting different NC end (Both the 5' and 3' untranslated regions that flank the protein-coding sequence and influence mRNA function without encoding proteins themselves.) lengths show that for Dataset 1 and Dataset 2, the prediction performance is best when the NC end is intercepted by 35 nucleotides, respectively. The results of both independent testing and five-fold cross-validation comparisons with established prediction tools show that MSlocPRED is significantly better than established tools for identifying multi-label mRNA subcellular localization. Additionally, to understand how the MSlocPRED model works during the prediction process, SHapley Additive exPlanations was used to explain it. The predictive model and associated datasets are available on the following github: https://github.com/ZBYnb1/MSlocPRED/tree/main.
Assuntos
Biologia Computacional , Aprendizado Profundo , RNA Mensageiro , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Biologia Computacional/métodos , Humanos , Software , AlgoritmosRESUMO
Current methods of molecular image-based drug discovery face two major challenges: (1) work effectively in absence of labels, and (2) capture chemical structure from implicitly encoded images. Given that chemical structures are explicitly encoded by molecular graphs (such as nitrogen, benzene rings and double bonds), we leverage self-supervised contrastive learning to transfer chemical knowledge from graphs to images. Specifically, we propose a novel Contrastive Graph-Image Pre-training (CGIP) framework for molecular representation learning, which learns explicit information in graphs and implicit information in images from large-scale unlabeled molecules via carefully designed intra- and inter-modal contrastive learning. We evaluate the performance of CGIP on multiple experimental settings (molecular property prediction, cross-modal retrieval and distribution similarity), and the results show that CGIP can achieve state-of-the-art performance on all 12 benchmark datasets and demonstrate that CGIP transfers chemical knowledge in graphs to molecular images, enabling image encoder to perceive chemical structures in images. We hope this simple and effective framework will inspire people to think about the value of image for molecular representation learning.
Assuntos
Benchmarking , Aprendizagem , Humanos , Descoberta de DrogasRESUMO
Potential miRNA-disease associations (MDA) play an important role in the discovery of complex human disease etiology. Therefore, MDA prediction is an attractive research topic in the field of biomedical machine learning. Recently, several models have been proposed for this task, but their performance limited by over-reliance on relevant network information with noisy graph structure connections. However, the application of self-supervised graph structure learning to MDA tasks remains unexplored. Our study is the first to use multi-view self-supervised contrastive learning (MSGCL) for MDA prediction. Specifically, we generated a learner view without association labels of miRNAs and diseases as input, and utilized the known association network to generate an anchor view that provides guiding signals for the learner view. The graph structure was optimized by designing a contrastive loss to maximize the consistency between the anchor and learner views. Our model is similar to a pre-trained model that continuously optimizes upstream tasks for high-quality association graph topology, thereby enhancing the latent representation of association predictions. The experimental results show that our proposed method outperforms state-of-the-art methods by 2.79$\%$ and 3.20$\%$ in area under the receiver operating characteristic curve (AUC) and area under the precision/recall curve (AUPR), respectively.
Assuntos
Aprendizado de Máquina , MicroRNAs , Humanos , Área Sob a Curva , MicroRNAs/genética , Curva ROCRESUMO
MOTIVATION: Molecular representation learning plays an indispensable role in crucial tasks such as property prediction and drug design. Despite the notable achievements of molecular pre-training models, current methods often fail to capture both the structural and feature semantics of molecular graphs. Moreover, while graph contrastive learning has unveiled new prospects, existing augmentation techniques often struggle to retain their core semantics. To overcome these limitations, we propose a gradient-compensated encoder parameter perturbation approach, ensuring efficient and stable feature augmentation. By merging enhancement strategies grounded in attribute masking and parameter perturbation, we introduce MoleMCL, a new MOLEcular pre-training model based on multi-level contrastive learning. RESULTS: Experimental results demonstrate that MoleMCL adeptly dissects the structure and feature semantics of molecular graphs, surpassing current state-of-the-art models in molecular prediction tasks, paving a novel avenue for molecular modeling. AVAILABILITY AND IMPLEMENTATION: The code and data underlying this work are available in GitHub at https://github.com/BioSequenceAnalysis/MoleMCL.
Assuntos
Desenho de Fármacos , SemânticaRESUMO
MOTIVATION: In the field of drug discovery, accurately and effectively predicting the binding affinity between proteins and ligands is crucial for drug screening and optimization. However, current research primarily utilizes representations based on sequence or structure to predict protein-ligand binding affinity, with relatively less study on protein surface information, which is crucial for protein-ligand interactions. Moreover, when dealing with multimodal information of proteins, traditional approaches typically concatenate features from different modalities in a straightforward manner without considering the heterogeneity among them, which results in an inability to effectively exploit the complementary between modalities. RESULTS: We introduce a novel multimodal feature extraction (MFE) framework that, for the first time, incorporates information from protein surfaces, 3D structures, and sequences, and uses cross-attention mechanism for feature alignment between different modalities. Experimental results show that our method achieves state-of-the-art performance in predicting protein-ligand binding affinity. Furthermore, we conduct ablation studies that demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within the framework. AVAILABILITY AND IMPLEMENTATION: The source code and data are available at https://github.com/Sultans0fSwing/MFE.
Assuntos
Ligação Proteica , Proteínas , Ligantes , Proteínas/metabolismo , Proteínas/química , Biologia Computacional/métodos , Descoberta de Drogas/métodos , Algoritmos , Sítios de Ligação , Bases de Dados de Proteínas , Conformação ProteicaRESUMO
The translated protein undergoes a specific modification process, which involves the formation of covalent bonds on lysine residues and the attachment of small chemical moieties. The protein's fundamental physicochemical properties undergo a significant alteration. The change significantly alters the proteins' 3D structure and activity, enabling them to modulate key physiological processes. The modulation encompasses inhibiting cancer cell growth, delaying ovarian aging, regulating metabolic diseases, and ameliorating depression. Consequently, the identification and comprehension of post-translational lysine modifications hold substantial value in the realms of biological research and drug development. Post-translational modifications (PTMs) at lysine (K) sites are among the most common protein modifications. However, research on K-PTMs has been largely centered on identifying individual modification types, with a relative scarcity of balanced data analysis techniques. In this study, a classification system is developed for the prediction of concurrent multiple modifications at a single lysine residue. Initially, a well-established multi-label position-specific triad amino acid propensity algorithm is utilized for feature encoding. Subsequently, PreMLS: a novel ClusterCentroids undersampling algorithm based on MiniBatchKmeans was introduced to eliminate redundant or similar major class samples, thereby mitigating the issue of class imbalance. A convolutional neural network architecture was specifically constructed for the analysis of biological sequences to predict multiple lysine modification sites. The model, evaluated through five-fold cross-validation and independent testing, was found to significantly outperform existing models such as iMul-kSite and predML-Site. The results presented here aid in prioritizing potential lysine modification sites, facilitating subsequent biological assays and advancing pharmaceutical research. To enhance accessibility, an open-access predictive script has been crafted for the multi-label predictive model developed in this study.
Assuntos
Algoritmos , Biologia Computacional , Lisina , Processamento de Proteína Pós-Traducional , Lisina/química , Lisina/metabolismo , Biologia Computacional/métodos , Proteínas/química , Proteínas/metabolismo , Humanos , Bases de Dados de Proteínas , Redes Neurais de ComputaçãoRESUMO
BACKGROUND: Recognition of enhancer-promoter Interactions (EPIs) is crucial for human development. EPIs in the genome play a key role in regulating transcription. However, experimental approaches for classifying EPIs are too expensive in terms of effort, time, and resources. Therefore, more and more studies are being done on developing computational techniques, particularly using deep learning and other machine learning techniques, to address such problems. Unfortunately, the majority of current computational methods are based on convolutional neural networks, recurrent neural networks, or a combination of them, which don't take into consideration contextual details and the long-range interactions between the enhancer and promoter sequences. A new transformer-based model called EPI-Trans is presented in this study to overcome the aforementioned limitations. The multi-head attention mechanism in the transformer model automatically learns features that represent the long interrelationships between enhancer and promoter sequences. Furthermore, a generic model is created with transferability that can be utilized as a pre-trained model for various cell lines. Moreover, the parameters of the generic model are fine-tuned using a particular cell line dataset to improve performance. RESULTS: Based on the results obtained from six benchmark cell lines, the average AUROC for the specific, generic, and best models is 94.2%, 95%, and 95.7%, while the average AUPR is 80.5%, 66.1%, and 79.6% respectively. CONCLUSIONS: This study proposed a transformer-based deep learning model for EPI prediction. The comparative results on certain cell lines show that EPI-Trans outperforms other cutting-edge techniques and can provide superior performance on the challenge of recognizing EPI.
Assuntos
Aprendizado Profundo , Elementos Facilitadores Genéticos , Regiões Promotoras Genéticas , Humanos , Biologia Computacional/métodos , Linhagem Celular , Redes Neurais de ComputaçãoRESUMO
The interaction between microribonucleic acid and long non-coding ribonucleic acid plays a very important role in biological processes, and the prediction of the one is of great significance to the study of its mechanism of action. Due to the limitations of traditional biological experiment methods, more and more computational methods are applied to this field. However, the existing methods often have problems, such as inadequate acquisition of potential features of the sequence due to simple coding and the need to manually extract features as input. We propose a deep learning model, preMLI, based on rna2vec pre-training and deep feature mining mechanism. We use rna2vec to train the ribonucleic acid (RNA) dataset and to obtain the RNA word vector representation and then mine the RNA sequence features separately and finally concatenate the two feature vectors as the input of the prediction task. The preMLI performs better than existing methods on benchmark datasets and has cross-species prediction capabilities. Experiments show that both pre-training and deep feature mining mechanisms have a positive impact on the prediction performance of the model. To be more specific, pre-training can provide more accurate word vector representations. The deep feature mining mechanism also improves the prediction performance of the model. Meanwhile, The preMLI only needs RNA sequence as the input of the model and has better cross-species prediction performance than the most advanced prediction models, which have reference value for related research.
Assuntos
MicroRNAs , RNA Longo não Codificante , Biologia Computacional/métodos , MicroRNAs/genética , RNA Longo não Codificante/genéticaRESUMO
Drug-target interactions (DTIs) prediction research presents important significance for promoting the development of modern medicine and pharmacology. Traditional biochemical experiments for DTIs prediction confront the challenges including long time period, high cost and high failure rate, and finally leading to a low-drug productivity. Chemogenomic-based computational methods can realize high-throughput prediction. In this study, we develop a deep collaborative filtering prediction model with multiembeddings, named DCFME (deep collaborative filtering prediction model with multiembeddings), which can jointly utilize multiple feature information from multiembeddings. Two different representation learning algorithms are first employed to extract heterogeneous network features. DCFME uses the generated low-dimensional dense vectors as input, and then simulates the drug-target relationship from the perspective of both couplings and heterogeneity. In addition, the model employs focal loss that concentrates the loss on sparse and hard samples in the training process. Comparative experiments with five baseline methods show that DCFME achieves more significant performance improvement on sparse datasets. Moreover, the model has better robustness and generalization capacity under several harder prediction scenarios.
Assuntos
Algoritmos , Desenvolvimento de Medicamentos , Desenvolvimento de Medicamentos/métodosRESUMO
Posttranslational modification of lysine residues, K-PTM, is one of the most popular PTMs. Some lysine residues in proteins can be continuously or cascaded covalently modified, such as acetylation, crotonylation, methylation and succinylation modification. The covalent modification of lysine residues may have some special functions in basic research and drug development. Although many computational methods have been developed to predict lysine PTMs, up to now, the K-PTM prediction methods have been modeled and learned a single class of K-PTM modification. In view of this, this study aims to fill this gap by building a multi-label computational model that can be directly used to predict multiple K-PTMs in proteins. In this study, a multi-label prediction model, MLysPRED, is proposed to identify multiple lysine sites using features generated from human protein sequences. In MLysPRED, three kinds of multi-label sequence encoding algorithms (MLDBPB, MLPSDAAP, MLPSTAAP) are proposed and combined with three encoding strategies (CHHAA, DR and Kmer) to convert preprocessed lysine sequences into effective numerical features. A multidimensional normal distribution oversampling technique and graph-based multi-view clustering under-sampling algorithm were first proposed and incorporated to reduce the proportion of the original training samples, and multi-label nearest neighbor algorithm is used for classification. It is observed that MLysPRED achieved an Aiming of 92.21%, Coverage of 94.98%, Accuracy of 89.63%, Absolute-True of 81.46% and Absolute-False of 0.0682 on the independent datasets. Additionally, comparison of results with five existing predictors also indicated that MLysPRED is very promising and encouraging to predict multiple K-PTMs in proteins. For the convenience of the experimental scientists, 'MLysPRED' has been deployed as a user-friendly web-server at http://47.100.136.41:8181.
Assuntos
Lisina , Proteínas , Algoritmos , Análise por Conglomerados , Biologia Computacional/métodos , Humanos , Lisina/metabolismo , Distribuição Normal , Processamento de Proteína Pós-Traducional , Proteínas/químicaRESUMO
Identifying new lead molecules to treat cancer requires more than a decade of dedicated effort. Before selected drug candidates are used in the clinic, their anti-cancer activity is generally validated by in vitro cellular experiments. Therefore, accurate prediction of cancer drug response is a critical and challenging task for anti-cancer drugs design and precision medicine. With the development of pharmacogenomics, the combination of efficient drug feature extraction methods and omics data has made it possible to use computational models to assist in drug response prediction. In this study, we propose DeepTTA, a novel end-to-end deep learning model that utilizes transformer for drug representation learning and a multilayer neural network for transcriptomic data prediction of the anti-cancer drug responses. Specifically, DeepTTA uses transcriptomic gene expression data and chemical substructures of drugs for drug response prediction. Compared to existing methods, DeepTTA achieved higher performance in terms of root mean square error, Pearson correlation coefficient and Spearman's rank correlation coefficient on multiple test sets. Moreover, we discovered that anti-cancer drugs bortezomib and dactinomycin provide a potential therapeutic option with multiple clinical indications. With its excellent performance, DeepTTA is expected to be an effective method in cancer drug design.
Assuntos
Antineoplásicos , Neoplasias , Antineoplásicos/farmacologia , Antineoplásicos/uso terapêutico , Humanos , Neoplasias/tratamento farmacológico , Neoplasias/genética , Redes Neurais de Computação , Medicina de Precisão/métodos , TranscriptomaRESUMO
Accumulation of coal gangue (CG) poses significant risks to both human health and ecological systems, underscoring the urgent need for sustainable methods to utilize this abundant waste material effectively. In this study, we successfully screened and isolated the Bacillus velezensis bacterial strain to enhance the solubilization of potassium (K) and phosphorus (P) nutrients from CG. The study employed XRD, FTIR, SEM, and HPLC techniques to investigate the underlying mechanisms of CG solubilization. Various parameters such as CG particle size, incubation time, initial inoculation ratio, concentration of CG, pH, and temperature were optimized to maximize solubilization efficiency. The Bacillus velezensis bacterium can dissolve minerals of CG by adhering to its surface and secreting various kinds of organic acids, particularly succinic acid. Pot experiments further demonstrated that Bacillus velezensis, in conjunction with CG, promotes alfalfa growth. These combined findings suggest that Bacillus velezensis and coal gangue hold promising potential as mineral soil conditioners, effectively enhancing plant growth. This approach presents a viable alternative for the environmentally responsible utilization of CG, addressing both ecological concerns and agricultural sustainability. IMPORTANCE: Coal gangue piles not only occupy significant amounts of arable land but also cause serious environmental pollution. Therefore, finding sustainable methods for the clean utilization of CG is imperative. Although previous studies have shown that bacteria can promote the solubilization of available phosphorus and available potassium from CG, their impact on promoting plant growth remains understudied. To our knowledge, this study is the first to demonstrate the potential of Bacillus velezensis in enhancing the effectiveness of CG as a mineral fertilizer to support alfalfa growth. The evidence presented in this study provides an ecological strategy for the utilization of CG.
RESUMO
The powerful combination of large-scale drug-related interaction networks and deep learning provides new opportunities for accelerating the process of drug discovery. However, chemical structures that play an important role in drug properties and high-order relations that involve a greater number of nodes are not tackled in current biomedical networks. In this study, we present a general hypergraph learning framework, which introduces Drug-Substructures relationship into Molecular interaction Networks to construct the micro-to-macro drug centric heterogeneous network (DSMN), and develop a multi-branches HyperGraph learning model, called HGDrug, for Drug multi-task predictions. HGDrug achieves highly accurate and robust predictions on 4 benchmark tasks (drug-drug, drug-target, drug-disease, and drug-side-effect interactions), outperforming 8 state-of-the-art task specific models and 6 general-purpose conventional models. Experiments analysis verifies the effectiveness and rationality of the HGDrug model architecture as well as the multi-branches setup, and demonstrates that HGDrug is able to capture the relations between drugs associated with the same functional groups. In addition, our proposed drug-substructure interaction networks can help improve the performance of existing network models for drug-related prediction tasks.
Assuntos
Algoritmos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Benchmarking , Sistemas de Liberação de Medicamentos , Descoberta de DrogasRESUMO
Developing selective and sensitive fluorescent probes for the detection of glutathione (GSH) concentration and intracellular distribution is of great significance for early diagnosis and treatment of diseases such as liver injury and cancer since GSH plays irreplaceable roles in regulating intracellular redox homeostasis. Herein, we present a new fluorescent probe that can be specifically activated by GSH through the conjugate addition and hydrolysis induced covalent-assembly approach for achieving zero-background interference fluorescence off-on sensing. Besides, the probe exhibited prominent selectivity and sensitivity, a low detection limit and cytotoxicity, thus successfully realizing specific real-time monitoring and tracking of GSH levels in living cells. As a consequence, this work might provide a potentially promising candidate for validating the function of GSH in various physiological and pathological processes, which is beneficial for early diagnosis and therapeutics of related diseases.
Assuntos
Corantes Fluorescentes , Glutationa , Corantes Fluorescentes/química , Corantes Fluorescentes/síntese química , Glutationa/análise , Glutationa/metabolismo , Humanos , Imagem Óptica , Células HeLa , Estrutura Molecular , Sobrevivência Celular/efeitos dos fármacos , Limite de DetecçãoRESUMO
Glycation, a type of posttranslational modification, preferentially occurs on lysine and arginine residues, impairing protein functionality and altering characteristics. This process is linked to diseases such as Alzheimer's, diabetes, and atherosclerosis. Traditional wet lab experiments are time-consuming, whereas machine learning has significantly streamlined the prediction of protein glycation sites. Despite promising results, challenges remain, including data imbalance, feature redundancy, and suboptimal classifier performance. This research introduces Glypred, a lysine glycation site prediction model combining ClusterCentroids Undersampling (CCU), LightGBM, and bidirectional long short-term memory network (BiLSTM) methodologies, with an additional multihead attention mechanism integrated into the BiLSTM. To achieve this, the study undertakes several key steps: selecting diverse feature types to capture comprehensive protein information, employing a cluster-based undersampling strategy to balance the data set, using LightGBM for feature selection to enhance model performance, and implementing a bidirectional LSTM network for accurate classification. Together, these approaches ensure that Glypred effectively identifies glycation sites with high accuracy and robustness. For feature encoding, five distinct feature typesâAAC, KMER, DR, PWAA, and EBGWâwere selected to capture a broad spectrum of protein sequence and biological information. These encoded features were integrated and validated to ensure comprehensive protein information acquisition. To address the issue of highly imbalanced positive and negative samples, various undersampling algorithms, including random undersampling, NearMiss, edited nearest neighbor rule, and CCU, were evaluated. CCU was ultimately chosen to remove redundant nonglycated training data, establishing a balanced data set that enhances the model's accuracy and robustness. For feature selection, the LightGBM ensemble learning algorithm was employed to reduce feature dimensionality by identifying the most significant features. This approach accelerates model training, enhances generalization capabilities, and ensures good transferability of the model. Finally, a bidirectional long short-term memory network was used as the classifier, with a network structure designed to capture glycation modification site features from both forward and backward directions. To prevent overfitting, appropriate regularization parameters and dropout rates were introduced, achieving efficient classification. Experimental results show that Glypred achieved optimal performance. This model provides new insights for bioinformatics and encourages the application of similar strategies in other fields. A lysine glycation site prediction software tool was also developed using the PyQt5 library, offering researchers an auxiliary screening tool to reduce workload and improve efficiency. The software and data sets are available on GitHub: https://github.com/ZBYnb/Glypred.
Assuntos
Lisina , Glicosilação , Lisina/química , Lisina/metabolismo , Proteínas/química , Proteínas/metabolismo , Aprendizado de Máquina , Biologia Computacional/métodos , Humanos , Redes Neurais de Computação , Bases de Dados de ProteínasRESUMO
This study investigates the production of biomethane, and variation in microbial community and coal molecular structures using gas chromatography, 16S rRNA high-throughput sequencing and Fourier transform infrared spectroscopy. Additionally, the factors influencing microbial community structure at a molecular level are discussed. The results demonstrate that bituminous coal exhibits a higher biomethane yield than anthracite coal. In bituminous coal samples, Escherichia and Proteiniphilum are the predominant bacteria at day 0, while Macellibacteroides dominates from days 5 to 35. Methanofollis is the dominated archaea during days 0 to 15, followed by Methanosarcina on day 35. In anthracite coal samples, Soehngenia is the dominant bacterial genus at day 0; however, it transitions to mainly Soehngenia and Aminobacterium within days 5-15 before evolving into Acetomicrobium on day 35. Methanocorpusculum is predominantly found in archaeal communities during days 0-15 but shifts to Methanosarcina on day 35. Alpha diversity analysis reveals that bacterial communities have higher species abundance and diversity compared to archaeal communities. Redundancy analysis indicates a significant correlation between coal molecular structure and bacterial community composition (P value < 0.05), whereas no correlation exists with archaeal community composition (P value > 0.05). The research findings provide theoretical support for revealing the biological gasification mechanisms of coal.
Assuntos
Archaea , Bactérias , Biocombustíveis , Carvão Mineral , Archaea/genética , Archaea/metabolismo , Bactérias/genética , Bactérias/classificação , Bactérias/metabolismo , Biocombustíveis/microbiologia , RNA Ribossômico 16S/genética , Microbiota , Metano/metabolismoRESUMO
As an efficient nitrogen removal process, heterotrophic-sulfur autotrophic denitrification (HSAD) has attracted extensive attention in wastewater treatment. However, the effects of magnetic powder (Fe3O4) on the electron transport activity in HSAD process remain unclear. Therefore, in this study, a heterotrophic-sulfur autotrophic denitrification system was established to remove nitrogen from marine recirculating aquacultural wastewater for evaluating the effects of Fe3O4. At the optimal Fe3O4 concentration of 50 mg/L, the nitrogen removal efficiency reached 100% with lower sulfate accumulation, and the start-up time was shortened. The assays of denitrifying enzymes and electron transport system activity showed that Fe3O4 improved the activities of nitrate and nitrite reductases, and increased the efficiency of electron transport. Microbial community analysis revealed that Fe3O4 enriched heterotrophic denitrifier Thauera and sulfur autotrophic denitrifier Canditatus Thiobios, and thus enhanced denitrification efficiencies. This study demonstrated that Fe3O4 is an efficient denitrification accelerator in HSAD for treating marine recirculating aquacultural wastewater.
RESUMO
Enhancer-promoter interactions (EPIs) play an important role in transcriptional regulation. Recently, machine learning-based methods have been widely used in the genome-scale identification of EPIs due to their promising predictive performance. In this paper, we propose a novel method, termed EPI-DLMH, for predicting EPIs with the use of DNA sequences only. EPI-DLMH consists of three major steps. First, a two-layer convolutional neural network is used to learn local features, and an bidirectional gated recurrent unit network is used to capture long-range dependencies on the sequences of promoters and enhancers. Second, an attention mechanism is used for focusing on relatively important features. Finally, a matching heuristic mechanism is introduced for the exploration of the interaction between enhancers and promoters. We use benchmark datasets in evaluating and comparing the proposed method with existing methods. Comparative results show that our model is superior to currently existing models in multiple cell lines. Specifically, we found that the matching heuristic mechanism introduced into the proposed model mainly contributes to the improvement of performance in terms of overall accuracy. Additionally, compared with existing models, our model is more efficient with regard to computational speed.
Assuntos
Aprendizado Profundo , Elementos Facilitadores Genéticos , Modelos Genéticos , Regiões Promotoras Genéticas , Biologia Computacional , Células HeLa , Heurística , Células Endoteliais da Veia Umbilical Humana , Humanos , Células K562RESUMO
As the best substitute for antibiotics, antimicrobial peptides (AMPs) have important research significance. Due to the high cost and difficulty of experimental methods for identifying AMPs, more and more researches are focused on using computational methods to solve this problem. Most of the existing calculation methods can identify AMPs through the sequence itself, but there is still room for improvement in recognition accuracy, and there is a problem that the constructed model cannot be universal in each dataset. The pre-training strategy has been applied to many tasks in natural language processing (NLP) and has achieved gratifying results. It also has great application prospects in the field of AMP recognition and prediction. In this paper, we apply the pre-training strategy to the model training of AMP classifiers and propose a novel recognition algorithm. Our model is constructed based on the BERT model, pre-trained with the protein data from UniProt, and then fine-tuned and evaluated on six AMP datasets with large differences. Our model is superior to the existing methods and achieves the goal of accurate identification of datasets with small sample size. We try different word segmentation methods for peptide chains and prove the influence of pre-training steps and balancing datasets on the recognition effect. We find that pre-training on a large number of diverse AMP data, followed by fine-tuning on new data, is beneficial for capturing both new data's specific features and common features between AMP sequences. Finally, we construct a new AMP dataset, on which we train a general AMP recognition model.
Assuntos
Algoritmos , Peptídeos Antimicrobianos/química , Biologia Computacional/métodos , Processamento de Linguagem Natural , Software , Peptídeos Antimicrobianos/farmacologia , Bases de Dados Genéticas , Reprodutibilidade dos TestesRESUMO
The increase in biological data and the formation of various biomolecule interaction databases enable us to obtain diverse biological networks. These biological networks provide a wealth of raw materials for further understanding of biological systems, the discovery of complex diseases and the search for therapeutic drugs. However, the increase in data also increases the difficulty of biological networks analysis. Therefore, algorithms that can handle large, heterogeneous and complex data are needed to better analyze the data of these network structures and mine their useful information. Deep learning is a branch of machine learning that extracts more abstract features from a larger set of training data. Through the establishment of an artificial neural network with a network hierarchy structure, deep learning can extract and screen the input information layer by layer and has representation learning ability. The improved deep learning algorithm can be used to process complex and heterogeneous graph data structures and is increasingly being applied to the mining of network data information. In this paper, we first introduce the used network data deep learning models. After words, we summarize the application of deep learning on biological networks. Finally, we discuss the future development prospects of this field.