RESUMEN
Understanding the genetic basis of disease is a fundamental aspect of medical research, as genes are the classic units of heredity and play a crucial role in biological function. Identifying associations between genes and diseases is critical for diagnosis, prevention, prognosis, and drug development. Genes that encode proteins with similar sequences are often implicated in related diseases, as proteins causing identical or similar diseases tend to show limited variation in their sequences. Predicting gene-disease association (GDA) requires time-consuming and expensive experiments on a large number of potential candidate genes. Although methods have been proposed to predict associations between genes and diseases using traditional machine learning algorithms and graph neural networks, these approaches struggle to capture the deep semantic information within the genes and diseases and are dependent on training data. To alleviate this issue, we propose a novel GDA prediction model named FusionGDA, which utilizes a pre-training phase with a fusion module to enrich the gene and disease semantic representations encoded by pre-trained language models. Multi-modal representations are generated by the fusion module, which includes rich semantic information about two heterogeneous biomedical entities: protein sequences and disease descriptions. Subsequently, the pooling aggregation strategy is adopted to compress the dimensions of the multi-modal representation. In addition, FusionGDA employs a pre-training phase leveraging a contrastive learning loss to extract potential gene and disease features by training on a large public GDA dataset. To rigorously evaluate the effectiveness of the FusionGDA model, we conduct comprehensive experiments on five datasets and compare our proposed model with five competitive baseline models on the DisGeNet-Eval dataset. Notably, our case study further demonstrates the ability of FusionGDA to discover hidden associations effectively. The complete code and datasets of our experiments are available at https://github.com/ZhaohanM/FusionGDA.
Asunto(s)
Aprendizaje Automático , Humanos , Biología Computacional/métodos , Predisposición Genética a la Enfermedad , Semántica , Algoritmos , Estudios de Asociación Genética , Redes Neurales de la ComputaciónRESUMEN
Protein-DNA interaction is critical for life activities such as replication, transcription and splicing. Identifying protein-DNA binding residues is essential for modeling their interaction and downstream studies. However, developing accurate and efficient computational methods for this task remains challenging. Improvements in this area have the potential to drive novel applications in biotechnology and drug design. In this study, we propose a novel approach called Contrastive Learning And Pre-trained Encoder (CLAPE), which combines a pre-trained protein language model and the contrastive learning method to predict DNA binding residues. We trained the CLAPE-DB model on the protein-DNA binding sites dataset and evaluated the model performance and generalization ability through various experiments. The results showed that the area under ROC curve values of the CLAPE-DB model on the two benchmark datasets reached 0.871 and 0.881, respectively, indicating superior performance compared to other existing models. CLAPE-DB showed better generalization ability and was specific to DNA-binding sites. In addition, we trained CLAPE on different protein-ligand binding sites datasets, demonstrating that CLAPE is a general framework for binding sites prediction. To facilitate the scientific community, the benchmark datasets and codes are freely available at https://github.com/YAndrewL/clape.
Asunto(s)
Benchmarking , Aprendizaje , Sitios de Unión , Diseño de Fármacos , Lenguaje , Unión ProteicaRESUMEN
The accurate prediction of the effect of amino acid mutations for protein-protein interactions (PPI $\Delta \Delta G$) is a crucial task in protein engineering, as it provides insight into the relevant biological processes underpinning protein binding and provides a basis for further drug discovery. In this study, we propose MpbPPI, a novel multi-task pre-training-based geometric equivariance-preserving framework to predict PPI $\Delta \Delta G$. Pre-training on a strictly screened pre-training dataset is employed to address the scarcity of protein-protein complex structures annotated with PPI $\Delta \Delta G$ values. MpbPPI employs a multi-task pre-training technique, forcing the framework to learn comprehensive backbone and side chain geometric regulations of protein-protein complexes at different scales. After pre-training, MpbPPI can generate high-quality representations capturing the effective geometric characteristics of labeled protein-protein complexes for downstream $\Delta \Delta G$ predictions. MpbPPI serves as a scalable framework supporting different sources of mutant-type (MT) protein-protein complexes for flexible application. Experimental results on four benchmark datasets demonstrate that MpbPPI is a state-of-the-art framework for PPI $\Delta \Delta G$ predictions. The data and source code are available at https://github.com/arantir123/MpbPPI.
Asunto(s)
Aminoácidos , Benchmarking , Mutación , Descubrimiento de Drogas , AprendizajeRESUMEN
Protein-ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional (3D) structure of protein-ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still limited. Although there is a vast amount of affinity data available in large-scale databases such as ChEMBL, issues such as inconsistent affinity measurement labels (i.e. IC50, Ki, Kd), different experimental conditions, and the lack of available 3D binding structures complicate the development of high-precision affinity prediction models using these data. To address these issues, we (i) propose Multi-task Bioassay Pre-training (MBP), a pre-training framework for structure-based PLBA prediction; (ii) construct a pre-training dataset called ChEMBL-Dock with more than 300k experimentally measured affinity labels and about 2.8M docked 3D structures. By introducing multi-task pre-training to treat the prediction of different affinity labels as different tasks and classifying relative rankings between samples from the same bioassay, MBP learns robust and transferrable structural knowledge from our new ChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the capability of MBP on the structure-based PLBA prediction task. To the best of our knowledge, MBP is the first affinity pre-training model and shows great potential for future development. MBP web-server is now available for free at: https://huggingface.co/spaces/jiaxianustc/mbp.
Asunto(s)
Descubrimiento de Drogas , Proteínas , Ligandos , Proteínas/química , Unión Proteica , Marcadores de AfinidadRESUMEN
With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https://cellknowledge.com.cn/RDscan/.
Asunto(s)
Minería de Datos , ARN , Minería de Datos/métodos , ARN/genética , Humanos , Aprendizaje Automático , Enfermedad/genética , Máquina de Vectores de Soporte , Programas InformáticosRESUMEN
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
Asunto(s)
Biología Computacional , Proteínas , Humanos , Proteínas/química , Biología Computacional/métodos , Secuencia de Aminoácidos , Estructura Secundaria de Proteína , AminoácidosRESUMEN
The interaction between microribonucleic acid and long non-coding ribonucleic acid plays a very important role in biological processes, and the prediction of the one is of great significance to the study of its mechanism of action. Due to the limitations of traditional biological experiment methods, more and more computational methods are applied to this field. However, the existing methods often have problems, such as inadequate acquisition of potential features of the sequence due to simple coding and the need to manually extract features as input. We propose a deep learning model, preMLI, based on rna2vec pre-training and deep feature mining mechanism. We use rna2vec to train the ribonucleic acid (RNA) dataset and to obtain the RNA word vector representation and then mine the RNA sequence features separately and finally concatenate the two feature vectors as the input of the prediction task. The preMLI performs better than existing methods on benchmark datasets and has cross-species prediction capabilities. Experiments show that both pre-training and deep feature mining mechanisms have a positive impact on the prediction performance of the model. To be more specific, pre-training can provide more accurate word vector representations. The deep feature mining mechanism also improves the prediction performance of the model. Meanwhile, The preMLI only needs RNA sequence as the input of the model and has better cross-species prediction performance than the most advanced prediction models, which have reference value for related research.
Asunto(s)
MicroARNs , ARN Largo no Codificante , Biología Computacional/métodos , MicroARNs/genética , ARN Largo no Codificante/genéticaRESUMEN
Molecular property prediction models based on machine learning algorithms have become important tools to triage unpromising lead molecules in the early stages of drug discovery. Compared with the mainstream descriptor- and graph-based methods for molecular property predictions, SMILES-based methods can directly extract molecular features from SMILES without human expert knowledge, but they require more powerful algorithms for feature extraction and a larger amount of data for training, which makes SMILES-based methods less popular. Here, we show the great potential of pre-training in promoting the predictions of important pharmaceutical properties. By utilizing three pre-training tasks based on atom feature prediction, molecular feature prediction and contrastive learning, a new pre-training method K-BERT, which can extract chemical information from SMILES like chemists, was developed. The calculation results on 15 pharmaceutical datasets show that K-BERT outperforms well-established descriptor-based (XGBoost) and graph-based (Attentive FP and HRGCN+) models. In addition, we found that the contrastive learning pre-training task enables K-BERT to 'understand' SMILES not limited to canonical SMILES. Moreover, the general fingerprints K-BERT-FP generated by K-BERT exhibit comparative predictive power to MACCS on 15 pharmaceutical datasets and can also capture molecular size and chirality information that traditional binary fingerprints cannot capture. Our results illustrate the great potential of K-BERT in the practical applications of molecular property predictions in drug discovery.
Asunto(s)
Algoritmos , Aprendizaje Automático , Humanos , Bases del Conocimiento , Preparaciones Farmacéuticas , Proyectos de InvestigaciónRESUMEN
SUMOylation is a protein post-translational modification that plays an essential role in cellular functions. For predicting SUMO sites, numerous researchers have proposed advanced methods based on ordinary machine learning algorithms. These reported methods have shown excellent predictive performance, but there is room for improvement. In this study, we constructed a novel deep neural network Residual Pyramid Network (RsFPN), and developed an ensemble deep learning predictor called iSUMO-RsFPN. Initially, three feature extraction methods were employed to extract features from samples. Following this, weak classifiers were trained based on RsFPN for each feature type. Ultimately, the weak classifiers were integrated to construct the final classifier. Moreover, the predictor underwent systematically testing on an independent test dataset, where the results demonstrated a significant improvement over the existing state-of-the-art predictors. The code of iSUMO-RsFPN is free and available at https://github.com/454170054/iSUMO-RsFPN.
Asunto(s)
Lisina , Sumoilación , Redes Neurales de la Computación , Aprendizaje Automático , AlgoritmosRESUMEN
DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned pre-training DNABert-Pruning model and artificial feature encoding to identify 4 mC sites. The algorithm prunes and compresses the DNABert model, resulting in the pruned pre-training model DNABert-Pruning. This model reduces the number of parameters and removes redundancy from output features, yielding more precise feature representations while upholding accuracy.Simultaneously, the algorithm constructs an artificial feature encoding module to assist the DNABert-Pruning model in feature representation, effectively supplementing the information that is missing from the pre-trained features. The algorithm also introduces the AFF-4mC fusion strategy, which combines artificial feature encoding with the DNABert-Pruning model, to improve the feature representation capability of DNA sequences in multi-semantic spaces and better extract 4 mC sites and the distribution of nucleotide importance within the sequence. In experiments on six independent test sets, the DNABert-4mC algorithm achieved an average AUC value of 93.81%, outperforming seven other advanced algorithms with improvements of 2.05%, 5.02%, 11.32%, 5.90%, 12.02%, 2.42% and 2.34%, respectively.
Asunto(s)
Algoritmos , ADN , ADN/genética , NucleótidosRESUMEN
Symbolic music understanding is a critical challenge in artificial intelligence. While traditional symbolic music representations like MIDI capture essential musical elements, they often lack the nuanced expression in music scores. Leveraging the advancements in multimodal pre-training, particularly in visual-language pre-training, we propose a groundbreaking approach: the Score Images as a Modality (SIM) model. This model integrates music score images alongside MIDI data for enhanced symbolic music understanding. We also introduce novel pre-training tasks, including masked bar-attribute modeling and score-MIDI matching. These tasks enable the SIM model to capture music structures and align visual and symbolic representations effectively. Additionally, we present a meticulously curated dataset of matched score images and MIDI representations optimized for training the SIM model. Through experimental validation, we demonstrate the efficacy of our approach in advancing symbolic music understanding.
RESUMEN
This paper introduces a novel data-driven self-triggered control approach based on a hierarchical reinforcement learning framework in networked motor control systems. This approach divides the self-triggered control policy into higher and lower layers, with the higher-level policy guiding the lower-level policy in decision-making, thereby reducing the exploration space of the lower-level policy and improving the efficiency of the learning process. The data-driven framework integrates with the dual-actor critic algorithm, using two interconnected neural networks to approximate the hierarchical policies. In this framework, we use recurrent neural networks as the network architecture for the critic, utilizing the temporal dynamics of recurrent neural networks to better capture the dependencies between costs, thus enhancing the critic network's efficiency and accuracy in approximating the multi-time cumulative cost function. Additionally, we have developed a pre-training method for the control policy networks to further improve learning efficiency. The effectiveness of our proposed method is validated through a series of numerical simulations.
RESUMEN
The echo state network (ESN) is a recurrent neural network that has yielded state-of-the-art results in many areas owing to its rapid learning ability and the fact that the weights of input neurons and hidden neurons are fixed throughout the learning process. However, the setting procedure for initializing the ESN's recurrent structure may lead to difficulties in designing a sound reservoir that matches a specific task. This paper proposes an improved pre-training method to adjust the model's parameters and topology to obtain an adaptive reservoir for a given application. Two strategies, namely global random selection and ensemble training, are introduced to pre-train the randomly initialized ESN model. Specifically, particle swarm optimization is applied to optimize chosen fixed and global weight values within the network, and the reliability and stability of the pre-trained model are enhanced by employing the ensemble training strategy. In addition, we test the feasibility of the model for time series prediction on six benchmarks and two real-life datasets. The experimental results show a clear enhancement in the ESN learning results. Furthermore, the proposed global random selection and ensemble training strategies are also applied to pre-train the extreme learning machine (ELM), which has a similar training process to the ESN model. Numerical experiments are subsequently carried out on the above-mentioned eight datasets. The experimental findings consistently show that the performance of the proposed pre-trained ELM model is also improved significantly. The suggested two strategies can thus enhance the ESN and ELM models' prediction accuracy and adaptability.
RESUMEN
For the increasing number of patients with depression, this paper proposes an artificial intelligence method to effectively identify depression through voice signals, with the aim of improving the efficiency of diagnosis and treatment. Firstly, a pre-training model called wav2vec 2.0 is fine-tuned to encode and contextualize the speech, thereby obtaining high-quality voice features. This model is applied to the publicly available dataset - the distress analysis interview corpus-wizard of OZ (DAIC-WOZ). The results demonstrate a precision rate of 93.96%, a recall rate of 94.87%, and an F1 score of 94.41% for the binary classification task of depression recognition, resulting in an overall classification accuracy of 96.48%. For the four-class classification task evaluating the severity of depression, the precision rates are all above 92.59%, the recall rates are all above 92.89%, the F1 scores are all above 93.12%, and the overall classification accuracy is 94.80%. The research findings indicate that the proposed method effectively enhances classification accuracy in scenarios with limited data, exhibiting strong performance in depression identification and severity evaluation. In the future, this method has the potential to serve as a valuable supportive tool for depression diagnosis.
Asunto(s)
Inteligencia Artificial , Depresión , Humanos , Depresión/diagnóstico , Reconocimiento en Psicología , HablaRESUMEN
As the best substitute for antibiotics, antimicrobial peptides (AMPs) have important research significance. Due to the high cost and difficulty of experimental methods for identifying AMPs, more and more researches are focused on using computational methods to solve this problem. Most of the existing calculation methods can identify AMPs through the sequence itself, but there is still room for improvement in recognition accuracy, and there is a problem that the constructed model cannot be universal in each dataset. The pre-training strategy has been applied to many tasks in natural language processing (NLP) and has achieved gratifying results. It also has great application prospects in the field of AMP recognition and prediction. In this paper, we apply the pre-training strategy to the model training of AMP classifiers and propose a novel recognition algorithm. Our model is constructed based on the BERT model, pre-trained with the protein data from UniProt, and then fine-tuned and evaluated on six AMP datasets with large differences. Our model is superior to the existing methods and achieves the goal of accurate identification of datasets with small sample size. We try different word segmentation methods for peptide chains and prove the influence of pre-training steps and balancing datasets on the recognition effect. We find that pre-training on a large number of diverse AMP data, followed by fine-tuning on new data, is beneficial for capturing both new data's specific features and common features between AMP sequences. Finally, we construct a new AMP dataset, on which we train a general AMP recognition model.
Asunto(s)
Algoritmos , Péptidos Antimicrobianos/química , Biología Computacional/métodos , Procesamiento de Lenguaje Natural , Programas Informáticos , Péptidos Antimicrobianos/farmacología , Bases de Datos Genéticas , Reproducibilidad de los ResultadosRESUMEN
MicroRNA (miRNA) plays an important role in the occurrence, development, diagnosis and treatment of diseases. More and more researchers begin to pay attention to the relationship between miRNA and disease. Compared with traditional biological experiments, computational method of integrating heterogeneous biological data to predict potential associations can effectively save time and cost. Considering the limitations of the previous computational models, we developed the model of deep-belief network for miRNA-disease association prediction (DBNMDA). We constructed feature vectors to pre-train restricted Boltzmann machines for all miRNA-disease pairs and applied positive samples and the same number of selected negative samples to fine-tune DBN to obtain the final predicted scores. Compared with the previous supervised models that only use pairs with known label for training, DBNMDA innovatively utilizes the information of all miRNA-disease pairs during the pre-training process. This step could reduce the impact of too few known associations on prediction accuracy to some extent. DBNMDA achieves the AUC of 0.9104 based on global leave-one-out cross validation (LOOCV), the AUC of 0.8232 based on local LOOCV and the average AUC of 0.9048 ± 0.0026 based on 5-fold cross validation. These AUCs are better than other previous models. In addition, three different types of case studies for three diseases were implemented to demonstrate the accuracy of DBNMDA. As a result, 84% (breast neoplasms), 100% (lung neoplasms) and 88% (esophageal neoplasms) of the top 50 predicted miRNAs were verified by recent literature. Therefore, we could conclude that DBNMDA is an effective method to predict potential miRNA-disease associations.
Asunto(s)
Predisposición Genética a la Enfermedad , MicroARNs/genética , Neoplasias de la Mama , Humanos , Neoplasias Pulmonares , Reproducibilidad de los ResultadosRESUMEN
The idea of the person re-identification (Re-ID) task is to find the person depicted in the query image among other images obtained from different cameras. Algorithms solving this task have important practical applications, such as illegal action prevention and searching for missing persons through a smart city's video surveillance. In most of the papers devoted to the problem under consideration, the authors propose complex algorithms to achieve a better quality of person Re-ID. Some of these methods cannot be used in practice due to technical limitations. In this paper, we propose several approaches that can be used in almost all popular modern re-identification algorithms to improve the quality of the problem being solved and do not practically increase the computational complexity of algorithms. In real-world data, bad images can be fed into the input of the Re-ID algorithm; therefore, the new Filter Module is proposed in this paper, designed to pre-filter input data before feeding the data to the main re-identification algorithm. The Filter Module improves the quality of the baseline by 2.6% according to the Rank1 metric and 3.4% according to the mAP metric on the Market-1501 dataset. Furthermore, in this paper, a fully automated data collection strategy from surveillance cameras for self-supervised pre-training is proposed in order to increase the generality of neural networks on real-world data. The use of self-supervised pre-training on the data collected using the proposed strategy improves the quality of cross-domain upper-body Re-ID on the DukeMTMC-reID dataset by 1.0% according to the Rank1 and mAP metrics.
RESUMEN
Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.
Asunto(s)
Percepción del Habla , Habla , Lenguaje , Software de Reconocimiento del Habla , RuidoRESUMEN
This study proposes a novel off-screen sound separation method based on audio-visual pre-training. In the field of audio-visual analysis, researchers have leveraged visual information for audio manipulation tasks, such as sound source separation. Although such audio manipulation tasks are based on correspondences between audio and video, these correspondences are not always established. Specifically, sounds coming from outside a screen have no audio-visual correspondences and thus interfere with conventional audio-visual learning. The proposed method separates such off-screen sounds based on their arrival directions using binaural audio, which provides us with three-dimensional sensation. Furthermore, we propose a new pre-training method that can consider the off-screen space and use the obtained representation to improve off-screen sound separation. Consequently, the proposed method can separate off-screen sounds irrespective of the direction from which they arrive. We conducted our evaluation using generated video data to circumvent the problem of difficulty in collecting ground truth for off-screen sounds. We confirmed the effectiveness of our methods through off-screen sound detection and separation tasks.
RESUMEN
In harsh weather conditions, the infrared modality can supplement or even replace the visible modality. However, the lack of a large-scale dataset for infrared features hinders the generation of a robust pre-training model. Most existing infrared object-detection algorithms rely on pre-training models from the visible modality, which can accelerate network convergence but also limit performance due to modality differences. In order to provide more reliable feature representation for cross-modality object detection and enhance its performance, this paper investigates the impact of various task-relevant features on cross-modality object detection and proposes a knowledge transfer algorithm based on classification and localization decoupling analysis. A task-decoupled pre-training method is introduced to adjust the attributes of various tasks learned by the pre-training model. For the training phase, a task-relevant hyperparameter evolution method is proposed to increase the network's adaptability to attribute changes in pre-training weights. Our proposed method improves the accuracy of multiple modalities in multiple datasets, with experimental results on the FLIR ADAS dataset reaching a state-of-the-art level and surpassing most multi-spectral object-detection methods.