Search | Nursing VHL Search Portal

1.

ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction.

Zhu, Yi-Heng; Liu, Zi; Liu, Yan; Ji, Zhiwei; Yu, Dong-Jun.

Brief Bioinform ; 25(2)2024 Jan 22.

Article in English | MEDLINE | ID: mdl-38349057

ABSTRACT

Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.

Subject(s)

Data Analysis , Language , Binding Sites , Amino Acid Sequence , Databases, Factual

2.

GMFGRN: a matrix factorization and graph neural network approach for gene regulatory network inference.

Li, Shuo; Liu, Yan; Shen, Long-Chen; Yan, He; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 25(2)2024 Jan 22.

Article in English | MEDLINE | ID: mdl-38261340

ABSTRACT

The recent advances of single-cell RNA sequencing (scRNA-seq) have enabled reliable profiling of gene expression at the single-cell level, providing opportunities for accurate inference of gene regulatory networks (GRNs) on scRNA-seq data. Most methods for inferring GRNs suffer from the inability to eliminate transitive interactions or necessitate expensive computational resources. To address these, we present a novel method, termed GMFGRN, for accurate graph neural network (GNN)-based GRN inference from scRNA-seq data. GMFGRN employs GNN for matrix factorization and learns representative embeddings for genes. For transcription factor-gene pairs, it utilizes the learned embeddings to determine whether they interact with each other. The extensive suite of benchmarking experiments encompassing eight static scRNA-seq datasets alongside several state-of-the-art methods demonstrated mean improvements of 1.9 and 2.5% over the runner-up in area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC). In addition, across four time-series datasets, maximum enhancements of 2.4 and 1.3% in AUROC and AUPRC were observed in comparison to the runner-up. Moreover, GMFGRN requires significantly less training time and memory consumption, with time and memory consumed <10% compared to the second-best method. These findings underscore the substantial potential of GMFGRN in the inference of GRNs. It is publicly available at https://github.com/Lishuoyy/GMFGRN.

Subject(s)

Benchmarking , Gene Regulatory Networks , Area Under Curve , Learning , Neural Networks, Computer

3.

TripletCell: a deep metric learning framework for accurate annotation of cell types at the single-cell level.

Liu, Yan; Wei, Guo; Li, Chen; Shen, Long-Chen; Gasser, Robin B; Song, Jiangning; Chen, Dijun; Yu, Dong-Jun.

Brief Bioinform ; 24(3)2023 05 19.

Article in English | MEDLINE | ID: mdl-37080771

ABSTRACT

Single-cell RNA sequencing (scRNA-seq) has significantly accelerated the experimental characterization of distinct cell lineages and types in complex tissues and organisms. Cell-type annotation is of great importance in most of the scRNA-seq analysis pipelines. However, manual cell-type annotation heavily relies on the quality of scRNA-seq data and marker genes, and therefore can be laborious and time-consuming. Furthermore, the heterogeneity of scRNA-seq datasets poses another challenge for accurate cell-type annotation, such as the batch effect induced by different scRNA-seq protocols and samples. To overcome these limitations, here we propose a novel pipeline, termed TripletCell, for cross-species, cross-protocol and cross-sample cell-type annotation. We developed a cell embedding and dimension-reduction module for the feature extraction (FE) in TripletCell, namely TripletCell-FE, to leverage the deep metric learning-based algorithm for the relationships between the reference gene expression matrix and the query cells. Our experimental studies on 21 datasets (covering nine scRNA-seq protocols, two species and three tissues) demonstrate that TripletCell outperformed state-of-the-art approaches for cell-type annotation. More importantly, regardless of protocols or species, TripletCell can deliver outstanding and robust performance in annotating different types of cells. TripletCell is freely available at https://github.com/liuyan3056/TripletCell. We believe that TripletCell is a reliable computational tool for accurately annotating various cell types using scRNA-seq data and will be instrumental in assisting the generation of novel biological hypotheses in cell biology.

Subject(s)

Algorithms , Single-Cell Analysis , Single-Cell Analysis/methods , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Cluster Analysis

4.

VPatho: a deep learning-based two-stage approach for accurate prediction of gain-of-function and loss-of-function variants.

Ge, Fang; Li, Chen; Iqbal, Shahid; Muhammad, Arif; Li, Fuyi; Thafar, Maha A; Yan, Zihao; Worachartcheewan, Apilak; Xu, Xiaofeng; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 24(1)2023 01 19.

Article in English | MEDLINE | ID: mdl-36528806

ABSTRACT

Determining the pathogenicity and functional impact (i.e. gain-of-function; GOF or loss-of-function; LOF) of a variant is vital for unraveling the genetic level mechanisms of human diseases. To provide a 'one-stop' framework for the accurate identification of pathogenicity and functional impact of variants, we developed a two-stage deep-learning-based computational solution, termed VPatho, which was trained using a total of 9619 pathogenic GOF/LOF and 138 026 neutral variants curated from various databases. A total number of 138 variant-level, 262 protein-level and 103 genome-level features were extracted for constructing the models of VPatho. The development of VPatho consists of two stages: (i) a random under-sampling multi-scale residual neural network (ResNet) with a newly defined weighted-loss function (RUS-Wg-MSResNet) was proposed to predict variants' pathogenicity on the gnomAD_NV + GOF/LOF dataset; and (ii) an XGBOD model was constructed to predict the functional impact of the given variants. Benchmarking experiments demonstrated that RUS-Wg-MSResNet achieved the highest prediction performance with the weights calculated based on the ratios of neutral versus pathogenic variants. Independent tests showed that both RUS-Wg-MSResNet and XGBOD achieved outstanding performance. Moreover, assessed using variants from the CAGI6 competition, RUS-Wg-MSResNet achieved superior performance compared to state-of-the-art predictors. The fine-trained XGBOD models were further used to blind test the whole LOF data downloaded from gnomAD and accordingly, we identified 31 nonLOF variants that were previously labeled as LOF/uncertain variants. As an implementation of the developed approach, a webserver of VPatho is made publicly available at http://csbio.njust.edu.cn/bioinf/vpatho/ to facilitate community-wide efforts for profiling and prioritizing the query variants with respect to their pathogenicity and functional impact.

Subject(s)

Deep Learning , Humans , Gain of Function Mutation , Genome

5.

MINDG: a drug-target interaction prediction method based on an integrated learning algorithm.

Yang, Hailong; Chen, Yue; Zuo, Yun; Deng, Zhaohong; Pan, Xiaoyong; Shen, Hong-Bin; Choi, Kup-Sze; Yu, Dong-Jun.

Bioinformatics ; 40(4)2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38483285

ABSTRACT

MOTIVATION: Drug-target interaction (DTI) prediction refers to the prediction of whether a given drug molecule will bind to a specific target and thus exert a targeted therapeutic effect. Although intelligent computational approaches for drug target prediction have received much attention and made many advances, they are still a challenging task that requires further research. The main challenges are manifested as follows: (i) most graph neural network-based methods only consider the information of the first-order neighboring nodes (drug and target) in the graph, without learning deeper and richer structural features from the higher-order neighboring nodes. (ii) Existing methods do not consider both the sequence and structural features of drugs and targets, and each method is independent of each other, and cannot combine the advantages of sequence and structural features to improve the interactive learning effect. RESULTS: To address the above challenges, a Multi-view Integrated learning Network that integrates Deep learning and Graph Learning (MINDG) is proposed in this study, which consists of the following parts: (i) a mixed deep network is used to extract sequence features of drugs and targets, (ii) a higher-order graph attention convolutional network is proposed to better extract and capture structural features, and (iii) a multi-view adaptive integrated decision module is used to improve and complement the initial prediction results of the above two networks to enhance the prediction performance. We evaluate MINDG on two dataset and show it improved DTI prediction performance compared to state-of-the-art baselines. AVAILABILITY AND IMPLEMENTATION: https://github.com/jnuaipr/MINDG.

Subject(s)

Algorithms , Neural Networks, Computer

6.

Prediction of disease-associated nsSNPs by integrating multi-scale ResNet models with deep feature fusion.

Ge, Fang; Zhang, Ying; Xu, Jian; Muhammad, Arif; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 23(1)2022 01 17.

Article in English | MEDLINE | ID: mdl-34953462

ABSTRACT

More than 6000 human diseases have been recorded to be caused by non-synonymous single nucleotide polymorphisms (nsSNPs). Rapid and accurate prediction of pathogenic nsSNPs can improve our understanding of the principle and design of new drugs, which remains an unresolved challenge. In the present work, a new computational approach, termed MSRes-MutP, is proposed based on ResNet blocks with multi-scale kernel size to predict disease-associated nsSNPs. By feeding the serial concatenation of the extracted four types of features, the performance of MSRes-MutP does not obviously improve. To address this, a second model FFMSRes-MutP is developed, which utilizes deep feature fusion strategy and multi-scale 2D-ResNet and 1D-ResNet blocks to extract relevant two-dimensional features and physicochemical properties. FFMSRes-MutP with the concatenated features achieves a better performance than that with individual features. The performance of FFMSRes-MutP is benchmarked on five different datasets. It achieves the Matthew's correlation coefficient (MCC) of 0.593 and 0.618 on the PredictSNP and MMP datasets, which are 0.101 and 0.210 higher than that of the existing best method PredictSNP1. When tested on the HumDiv and HumVar datasets, it achieves MCC of 0.9605 and 0.9507, and area under curve (AUC) of 0.9796 and 0.9748, which are 0.1747 and 0.2669, 0.0853 and 0.1335, respectively, higher than the existing best methods PolyPhen-2 and FATHMM (weighted). In addition, on blind test using a third-party dataset, FFMSRes-MutP performs as the second-best predictor (with MCC and AUC of 0.5215 and 0.7633, respectively), when compared with the other four predictors. Extensive benchmarking experiments demonstrate that FFMSRes-MutP achieves effective feature fusion and can be explored as a useful approach for predicting disease-associated nsSNPs. The webserver is freely available at http://csbio.njust.edu.cn/bioinf/ffmsresmutp/ for academic use.

Subject(s)

Deep Learning , Disease/genetics , Polymorphism, Single Nucleotide , Algorithms , Area Under Curve , Cellular Microenvironment , Computational Biology/methods , Humans , Mutation , Pharmaceutical Preparations

7.

MAResNet: predicting transcription factor binding sites by combining multi-scale bottom-up and top-down attention and residual network.

Han, Ke; Shen, Long-Chen; Zhu, Yi-Heng; Xu, Jian; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 23(1)2022 01 17.

Article in English | MEDLINE | ID: mdl-34664074

ABSTRACT

Accurate identification of transcription factor binding sites is of great significance in understanding gene expression, biological development and drug design. Although a variety of methods based on deep-learning models and large-scale data have been developed to predict transcription factor binding sites in DNA sequences, there is room for further improvement in prediction performance. In addition, effective interpretation of deep-learning models is greatly desirable. Here we present MAResNet, a new deep-learning method, for predicting transcription factor binding sites on 690 ChIP-seq datasets. More specifically, MAResNet combines the bottom-up and top-down attention mechanisms and a state-of-the-art feed-forward network (ResNet), which is constructed by stacking attention modules that generate attention-aware features. In particular, the multi-scale attention mechanism is utilized at the first stage to extract rich and representative sequence features. We further discuss the attention-aware features learned from different attention modules in accordance with the changes as the layers go deeper. The features learned by MAResNet are also visualized through the TMAP tool to illustrate that the method can extract the unique characteristics of transcription factor binding sites. The performance of MAResNet is extensively tested on 690 test subsets with an average AUC of 0.927, which is higher than that of the current state-of-the-art methods. Overall, this study provides a new and useful framework for the prediction of transcription factor binding sites by combining the funnel attention modules with the residual network.

Subject(s)

Deep Learning , Binding Sites/genetics , Neural Networks, Computer , Protein Binding , Transcription Factors/metabolism

8.

MDGF-MCEC: a multi-view dual attention embedding model with cooperative ensemble learning for CircRNA-disease association prediction.

Wu, Qunzhuo; Deng, Zhaohong; Pan, Xiaoyong; Shen, Hong-Bin; Choi, Kup-Sze; Wang, Shitong; Wu, Jing; Yu, Dong-Jun.

Brief Bioinform ; 23(5)2022 09 20.

Article in English | MEDLINE | ID: mdl-35907779

ABSTRACT

Circular RNA (circRNA) is closely involved in physiological and pathological processes of many diseases. Discovering the associations between circRNAs and diseases is of great significance. Due to the high-cost to verify the circRNA-disease associations by wet-lab experiments, computational approaches for predicting the associations become a promising research direction. In this paper, we propose a method, MDGF-MCEC, based on multi-view dual attention graph convolution network (GCN) with cooperative ensemble learning to predict circRNA-disease associations. First, MDGF-MCEC constructs two disease relation graphs and two circRNA relation graphs based on different similarities. Then, the relation graphs are fed into a multi-view GCN for representation learning. In order to learn high discriminative features, a dual-attention mechanism is introduced to adjust the contribution weights, at both channel level and spatial level, of different features. Based on the learned embedding features of diseases and circRNAs, nine different feature combinations between diseases and circRNAs are treated as new multi-view data. Finally, we construct a multi-view cooperative ensemble classifier to predict the associations between circRNAs and diseases. Experiments conducted on the CircR2Disease database demonstrate that the proposed MDGF-MCEC model achieves a high area under curve of 0.9744 and outperforms the state-of-the-art methods. Promising results are also obtained from experiments on the circ2Disease and circRNADisease databases. Furthermore, the predicted associated circRNAs for hepatocellular carcinoma and gastric cancer are supported by the literature. The code and dataset of this study are available at https://github.com/ABard0/MDGF-MCEC.

Subject(s)

RNA, Circular , Stomach Neoplasms , Humans , Intercellular Signaling Peptides and Proteins , Machine Learning , Stomach Neoplasms/genetics

9.

PScL-2LSAESM: bioimage-based prediction of protein subcellular localization by integrating heterogeneous features with the two-level SAE-SM and mean ensemble method.

Ullah, Matee; Hadi, Fazal; Song, Jiangning; Yu, Dong-Jun.

Bioinformatics ; 39(1)2023 01 01.

Article in English | MEDLINE | ID: mdl-36413068

ABSTRACT

MOTIVATION: Over the past decades, a variety of in silico methods have been developed to predict protein subcellular localization within cells. However, a common and major challenge in the design and development of such methods is how to effectively utilize the heterogeneous feature sets extracted from bioimages. In this regards, limited efforts have been undertaken. RESULTS: We propose a new two-level stacked autoencoder network (termed 2L-SAE-SM) to improve its performance by integrating the heterogeneous feature sets. In particular, in the first level of 2L-SAE-SM, each optimal heterogeneous feature set is fed to train our designed stacked autoencoder network (SAE-SM). All the trained SAE-SMs in the first level can output the decision sets based on their respective optimal heterogeneous feature sets, known as 'intermediate decision' sets. Such intermediate decision sets are then ensembled using the mean ensemble method to generate the 'intermediate feature' set for the second-level SAE-SM. Using the proposed framework, we further develop a novel predictor, referred to as PScL-2LSAESM, to characterize image-based protein subcellular localization. Extensive benchmarking experiments on the latest benchmark training and independent test datasets collected from the human protein atlas databank demonstrate the effectiveness of the proposed 2L-SAE-SM framework for the integration of heterogeneous feature sets. Moreover, performance comparison of the proposed PScL-2LSAESM with current state-of-the-art methods further illustrates that PScL-2LSAESM clearly outperforms the existing state-of-the-art methods for the task of protein subcellular localization. AVAILABILITY AND IMPLEMENTATION: https://github.com/csbio-njust-edu/PScL-2LSAESM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Computational Biology , Humans , Protein Transport , Computational Biology/methods

10.

Interpretable prediction models for widespread m6A RNA modification across cell lines and tissues.

Zhang, Ying; Wang, Zhikang; Zhang, Yiwen; Li, Shanshan; Guo, Yuming; Song, Jiangning; Yu, Dong-Jun.

Bioinformatics ; 39(12)2023 12 01.

Article in English | MEDLINE | ID: mdl-37995291

ABSTRACT

MOTIVATION: RNA N6-methyladenosine (m6A) in Homo sapiens plays vital roles in a variety of biological functions. Precise identification of m6A modifications is thus essential to elucidation of their biological functions and underlying molecular-level mechanisms. Currently available high-throughput single-nucleotide-resolution m6A modification data considerably accelerated the identification of RNA modification sites through the development of data-driven computational methods. Nevertheless, existing methods have limitations in terms of the coverage of single-nucleotide-resolution cell lines and have poor capability in model interpretations, thereby having limited applicability. RESULTS: In this study, we present CLSM6A, comprising a set of deep learning-based models designed for predicting single-nucleotide-resolution m6A RNA modification sites across eight different cell lines and three tissues. Extensive benchmarking experiments are conducted on well-curated datasets and accordingly, CLSM6A achieves superior performance than current state-of-the-art methods. Furthermore, CLSM6A is capable of interpreting the prediction decision-making process by excavating critical motifs activated by filters and pinpointing highly concerned positions in both forward and backward propagations. CLSM6A exhibits better portability on similar cross-cell line/tissue datasets, reveals a strong association between highly activated motifs and high-impact motifs, and demonstrates complementary attributes of different interpretation strategies. AVAILABILITY AND IMPLEMENTATION: The webserver is available at http://csbio.njust.edu.cn/bioinf/clsm6a. The datasets and code are available at https://github.com/zhangying-njust/CLSM6A/.

Subject(s)

Nucleotides , RNA , Humans , RNA/metabolism , Adenosine/genetics , Adenosine/metabolism , Sequence Analysis, RNA/methods

11.

MLNGCF: circRNA-disease associations prediction with multilayer attention neural graph-based collaborative filtering.

Wu, Qunzhuo; Deng, Zhaohong; Zhang, Wei; Pan, Xiaoyong; Choi, Kup-Sze; Zuo, Yun; Shen, Hong-Bin; Yu, Dong-Jun.

Bioinformatics ; 39(8)2023 08 01.

Article in English | MEDLINE | ID: mdl-37561093

ABSTRACT

MOTIVATION: CircRNAs play a critical regulatory role in physiological processes, and the abnormal expression of circRNAs can mediate the processes of diseases. Therefore, exploring circRNAs-disease associations is gradually becoming an important area of research. Due to the high cost of validating circRNA-disease associations using traditional wet-lab experiments, novel computational methods based on machine learning are gaining more and more attention in this field. However, current computational methods suffer to insufficient consideration of latent features in circRNA-disease interactions. RESULTS: In this study, a multilayer attention neural graph-based collaborative filtering (MLNGCF) is proposed. MLNGCF first enhances multiple biological information with autoencoder as the initial features of circRNAs and diseases. Then, by constructing a central network of different diseases and circRNAs, a multilayer cooperative attention-based message propagation is performed on the central network to obtain the high-order features of circRNAs and diseases. A neural network-based collaborative filtering is constructed to predict the unknown circRNA-disease associations and update the model parameters. Experiments on the benchmark datasets demonstrate that MLNGCF outperforms state-of-the-art methods, and the prediction results are supported by the literature in the case studies. AVAILABILITY AND IMPLEMENTATION: The source codes and benchmark datasets of MLNGCF are available at https://github.com/ABard0/MLNGCF.

Subject(s)

Neural Networks, Computer , RNA, Circular , Machine Learning , Software , Computational Biology/methods

12.

FCMSTrans: Accurate Prediction of Disease-Associated nsSNPs by Utilizing Multiscale Convolution and Deep Feature Combination within a Transformer Framework.

Zhang, Ming; Gong, Chao; Ge, Fang; Yu, Dong-Jun.

J Chem Inf Model ; 64(4): 1394-1406, 2024 Feb 26.

Article in English | MEDLINE | ID: mdl-38349747

ABSTRACT

Nonsynonymous single-nucleotide polymorphisms (nsSNPs), implicated in over 6000 diseases, necessitate accurate prediction for expedited drug discovery and improved disease diagnosis. In this study, we propose FCMSTrans, a novel nsSNP predictor that innovatively combines the transformer framework and multiscale modules for comprehensive feature extraction. The distinctive attribute of FCMSTrans resides in a deep feature combination strategy. This strategy amalgamates evolutionary-scale modeling (ESM) and ProtTrans (PT) features, providing an understanding of protein biochemical properties, and position-specific scoring matrix, secondary structure, predicted relative solvent accessibility, and predicted disorder (PSPP) features, which are derived from four protein sequences and structure-oriented characteristics. This feature combination offers a comprehensive view of the molecular dynamics involving nsSNPs. Our model employs the transformer's self-attention mechanisms across multiple layers, extracting higher-level and abstract representations. Simultaneously, varied-level features are captured by multiscale convolutions, enriching feature abstraction at multiple echelons. Our comparative analyses with existing methodologies highlight significant improvements made possible by the integrated feature fusion approach adopted in FCMSTrans. This is further substantiated by performance assessments based on diverse data sets, such as PredictSNP, MMP, and PMD, with areas under the curve (AUCs) of 0.869, 0.819, and 0.693, respectively. Furthermore, FCMSTrans shows robustness and superiority by outperforming the current best predictor, PROVEAN, in a blind test conducted on a third-party data set, achieving an impressive AUC score of 0.7838. The Python code of FCMSTrans is available at https://github.com/gc212/FCMSTrans for academic usage.

Subject(s)

Drug Discovery , Electric Power Supplies , Amino Acid Sequence , Area Under Curve , Polymorphism, Single Nucleotide

13.

TM-search: An Efficient and Effective Tool for Protein Structure Database Search.

Liu, Zi; Zhang, Chengxin; Zhang, Qidi; Zhang, Yang; Yu, Dong-Jun.

J Chem Inf Model ; 64(3): 1043-1049, 2024 Feb 12.

Article in English | MEDLINE | ID: mdl-38270339

ABSTRACT

The quickly increasing size of the Protein Data Bank is challenging biologists to develop a more scalable protein structure alignment tool for fast structure database search. Although many protein structure search algorithms and programs have been designed and implemented for this purpose, most require a large amount of computational time. We propose a novel protein structure search approach, TM-search, which is based on the pairwise structure alignment program TM-align and a new iterative clustering algorithm. Benchmark tests demonstrate that TM-search is 27 times faster than a TM-align full database search while still being able to identify â¼90% of all high TM-score hits, which is 2-10 times more than other existing programs such as Foldseek, Dali, and PSI-BLAST.

Subject(s)

Algorithms , Proteins , Databases, Protein , Sequence Alignment , Proteins/chemistry , Benchmarking , Software

14.

TransEFVP: A Two-Stage Approach for the Prediction of Human Pathogenic Variants Based on Protein Sequence Embedding Fusion.

Yan, Zihao; Ge, Fang; Liu, Yan; Zhang, Yumeng; Li, Fuyi; Song, Jiangning; Yu, Dong-Jun.

J Chem Inf Model ; 64(4): 1407-1418, 2024 Feb 26.

Article in English | MEDLINE | ID: mdl-38334115

ABSTRACT

Studying the effect of single amino acid variations (SAVs) on protein structure and function is integral to advancing our understanding of molecular processes, evolutionary biology, and disease mechanisms. Screening for deleterious variants is one of the crucial issues in precision medicine. Here, we propose a novel computational approach, TransEFVP, based on large-scale protein language model embeddings and a transformer-based neural network to predict disease-associated SAVs. The model adopts a two-stage architecture: the first stage is designed to fuse different feature embeddings through a transformer encoder. In the second stage, a support vector machine model is employed to quantify the pathogenicity of SAVs after dimensionality reduction. The prediction performance of TransEFVP on blind test data achieves a Matthews correlation coefficient of 0.751, an F1-score of 0.846, and an area under the receiver operating characteristic curve of 0.871, higher than the existing state-of-the-art methods. The benchmark results demonstrate that TransEFVP can be explored as an accurate and effective SAV pathogenicity prediction method. The data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master for academic use.

Subject(s)

Algorithms , Proteins , Humans , Proteins/chemistry , Amino Acid Sequence , Neural Networks, Computer , Amino Acids

15.

Why can deep convolutional neural networks improve protein fold recognition? A visual explanation by interpretation.

Liu, Yan; Zhu, Yi-Heng; Song, Xiaoning; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(5)2021 09 02.

Article in English | MEDLINE | ID: mdl-33537753

ABSTRACT

As an essential task in protein structure and function prediction, protein fold recognition has attracted increasing attention. The majority of the existing machine learning-based protein fold recognition approaches strongly rely on handcrafted features, which depict the characteristics of different protein folds; however, effective feature extraction methods still represent the bottleneck for further performance improvement of protein fold recognition. As a powerful feature extractor, deep convolutional neural network (DCNN) can automatically extract discriminative features for fold recognition without human intervention, which has demonstrated an impressive performance on protein fold recognition. Despite the encouraging progress, DCNN often acts as a black box, and as such, it is challenging for users to understand what really happens in DCNN and why it works well for protein fold recognition. In this study, we explore the intrinsic mechanism of DCNN and explain why it works for protein fold recognition using a visual explanation technique. More specifically, we first trained a VGGNet-based DCNN model, termed VGGNet-FE, which can extract fold-specific features from the predicted protein residue-residue contact map for protein fold recognition. Subsequently, based on the trained VGGNet-FE, we implemented a new contact-assisted predictor, termed VGGfold, for protein fold recognition; we then visualized what features were extracted by each of the convolutional layers in VGGNet-FE using a deconvolution technique. Furthermore, we visualized the high-level semantic information, termed fold-discriminative region, of a predicted contact map from the localization map obtained from the last convolutional layer of VGGNet-FE. It is visually confirmed that VGGNet-FE could effectively extract distinct fold-discriminative regions for different types of protein folds, thereby accounting for the improved performance of VGGfold for protein fold recognition. In summary, this study is of great significance for both understanding the working principle of DCNNs in protein fold recognition and exploring the relationship between the predicted protein contact map and protein tertiary structure. This proposed visualization method is flexible and applicable to address other DCNN-based bioinformatics and computational biology questions. The online web server of VGGfold is freely available at http://csbio.njust.edu.cn/bioinf/vggfold/.

Subject(s)

Computational Biology/methods , Machine Learning , Neural Networks, Computer , Protein Folding , Proteins/chemistry , Data Visualization , Humans , Protein Interaction Maps , Protein Structure, Tertiary , Proteins/metabolism , Semantics

16.

SAResNet: self-attention residual network for predicting DNA-protein binding.

Shen, Long-Chen; Liu, Yan; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(5)2021 09 02.

Article in English | MEDLINE | ID: mdl-33837387

ABSTRACT

Knowledge of the specificity of DNA-protein binding is crucial for understanding the mechanisms of gene expression, regulation and gene therapy. In recent years, deep-learning-based methods for predicting DNA-protein binding from sequence data have achieved significant success. Nevertheless, the current state-of-the-art computational methods have some drawbacks associated with the use of limited datasets with insufficient experimental data. To address this, we propose a novel transfer learning-based method, termed SAResNet, which combines the self-attention mechanism and residual network structure. More specifically, the attention-driven module captures the position information of the sequence, while the residual network structure guarantees that the high-level features of the binding site can be extracted. Meanwhile, the pre-training strategy used by SAResNet improves the learning ability of the network and accelerates the convergence speed of the network during transfer learning. The performance of SAResNet is extensively tested on 690 datasets from the ChIP-seq experiments with an average AUC of 92.0%, which is 4.4% higher than that of the best state-of-the-art method currently available. When tested on smaller datasets, the predictive performance is more clearly improved. Overall, we demonstrate that the superior performance of DNA-protein binding prediction on DNA sequences can be achieved by combining the attention mechanism and residual structure, and a novel pipeline is accordingly developed. The proposed methodology is generally applicable and can be used to address any other sequence classification problems.

Subject(s)

Algorithms , Computational Biology/methods , DNA-Binding Proteins/metabolism , DNA/metabolism , Deep Learning , Neural Networks, Computer , Binding Sites/genetics , DNA/genetics , Humans , Internet , Protein Binding , Reproducibility of Results

17.

PScL-HDeep: image-based prediction of protein subcellular location in human tissue using ensemble learning of handcrafted and deep learned features with two-layer feature selection.

Ullah, Matee; Han, Ke; Hadi, Fazal; Xu, Jian; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(6)2021 11 05.

Article in English | MEDLINE | ID: mdl-34337652

ABSTRACT

Protein subcellular localization plays a crucial role in characterizing the function of proteins and understanding various cellular processes. Therefore, accurate identification of protein subcellular location is an important yet challenging task. Numerous computational methods have been proposed to predict the subcellular location of proteins. However, most existing methods have limited capability in terms of the overall accuracy, time consumption and generalization power. To address these problems, in this study, we developed a novel computational approach based on human protein atlas (HPA) data, referred to as PScL-HDeep, for accurate and efficient image-based prediction of protein subcellular location in human tissues. We extracted different handcrafted and deep learned (by employing pretrained deep learning model) features from different viewpoints of the image. The step-wise discriminant analysis (SDA) algorithm was applied to generate the optimal feature set from each original raw feature set. To further obtain a more informative feature subset, support vector machine-based recursive feature elimination with correlation bias reduction (SVM-RFE + CBR) feature selection algorithm was applied to the integrated feature set. Finally, the classification models, namely support vector machine with radial basis function (SVM-RBF) and support vector machine with linear kernel (SVM-LNR), were learned on the final selected feature set. To evaluate the performance of the proposed method, a new gold standard benchmark training dataset was constructed from the HPA databank. PScL-HDeep achieved the maximum performance on 10-fold cross validation test on this dataset and showed a better efficacy over existing predictors. Furthermore, we also illustrated the generalization ability of the proposed method by conducting a stringent independent validation test.

Subject(s)

Deep Learning , Proteins/metabolism , Subcellular Fractions/metabolism , Computational Biology/methods , Humans , Support Vector Machine

18.

Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features.

Zhu, Yi-Heng; Hu, Jun; Ge, Fang; Li, Fuyi; Song, Jiangning; Zhang, Yang; Yu, Dong-Jun.

Brief Bioinform ; 22(3)2021 05 20.

Article in English | MEDLINE | ID: mdl-32436937

ABSTRACT

X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew's correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.

Subject(s)

Computational Biology/methods , Crystallization/methods , Proteins/chemistry , Amino Acid Sequence , Crystallography, X-Ray , Databases, Protein , Models, Chemical

19.

Improving protein fold recognition using triplet network and ensemble deep learning.

Liu, Yan; Han, Ke; Zhu, Yi-Heng; Zhang, Ying; Shen, Long-Chen; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(6)2021 11 05.

Article in English | MEDLINE | ID: mdl-34226918

ABSTRACT

Protein fold recognition is a critical step toward protein structure and function prediction, aiming at providing the most likely fold type of the query protein. In recent years, the development of deep learning (DL) technique has led to massive advances in this important field, and accordingly, the sensitivity of protein fold recognition has been dramatically improved. Most DL-based methods take an intermediate bottleneck layer as the feature representation of proteins with new fold types. However, this strategy is indirect, inefficient and conditional on the hypothesis that the bottleneck layer's representation is assumed as a good representation of proteins with new fold types. To address the above problem, in this work, we develop a new computational framework by combining triplet network and ensemble DL. We first train a DL-based model, termed FoldNet, which employs triplet loss to train the deep convolutional network. FoldNet directly optimizes the protein fold embedding itself, making the proteins with the same fold types be closer to each other than those with different fold types in the new protein embedding space. Subsequently, using the trained FoldNet, we implement a new residue-residue contact-assisted predictor, termed FoldTR, which improves protein fold recognition. Furthermore, we propose a new ensemble DL method, termed FSD_XGBoost, which combines protein fold embedding with the other two discriminative fold-specific features extracted by two DL-based methods SSAfold and DeepFR. The Top 1 sensitivity of FSD_XGBoost increases to 74.8% at the fold level, which is ~9% higher than that of the state-of-the-art method. Together, the results suggest that fold-specific features extracted by different DL methods complement with each other, and their combination can further improve fold recognition at the fold level. The implemented web server of FoldTR and benchmark datasets are publicly available at http://csbio.njust.edu.cn/bioinf/foldtr/.

Subject(s)

Computational Biology/methods , Deep Learning , Models, Molecular , Protein Conformation , Protein Folding , Proteins/chemistry , Algorithms , Databases, Protein , Neural Networks, Computer , Reproducibility of Results , Sensitivity and Specificity

20.

Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine sites.

Zhang, Ying; Liu, Yan; Xu, Jian; Wang, Xiaoyu; Peng, Xinxin; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(6)2021 11 05.

Article in English | MEDLINE | ID: mdl-34459479

ABSTRACT

DNA N6-methyladenine is an important type of DNA modification that plays important roles in multiple biological processes. Despite the recent progress in developing DNA 6mA site prediction methods, several challenges remain to be addressed. For example, although the hand-crafted features are interpretable, they contain redundant information that may bias the model training and have a negative impact on the trained model. Furthermore, although deep learning (DL)-based models can perform feature extraction and classification automatically, they lack the interpretability of the crucial features learned by those models. As such, considerable research efforts have been focused on achieving the trade-off between the interpretability and straightforwardness of DL neural networks. In this study, we develop two new DL-based models for improving the prediction of N6-methyladenine sites, termed LA6mA and AL6mA, which use bidirectional long short-term memory to respectively capture the long-range information and self-attention mechanism to extract the key position information from DNA sequences. The performance of the two proposed methods is benchmarked and evaluated on the two model organisms Arabidopsis thaliana and Drosophila melanogaster. On the two benchmark datasets, LA6mA achieves an area under the receiver operating characteristic curve (AUROC) value of 0.962 and 0.966, whereas AL6mA achieves an AUROC value of 0.945 and 0.941, respectively. Moreover, an in-depth analysis of the attention matrix is conducted to interpret the important information, which is hidden in the sequence and relevant for 6mA site prediction. The two novel pipelines developed for DNA 6mA site prediction in this work will facilitate a better understanding of the underlying principle of DL-based DNA methylation site prediction and its future applications.

Subject(s)

Adenosine/analogs & derivatives , Computational Biology/methods , DNA Methylation , DNA/genetics , Epigenomics/methods , DNA/chemistry , Deep Learning

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL