Pesquisa | Secretaria de Estado da Saúde

1.

ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction.

Zhu, Yi-Heng; Liu, Zi; Liu, Yan; Ji, Zhiwei; Yu, Dong-Jun.

Brief Bioinform ; 25(2)2024 Jan 22.

Artigo em Inglês | MEDLINE | ID: mdl-38349057

RESUMO

Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.

Assuntos

Análise de Dados , Idioma , Sítios de Ligação , Sequência de Aminoácidos , Bases de Dados Factuais

2.

MAResNet: predicting transcription factor binding sites by combining multi-scale bottom-up and top-down attention and residual network.

Han, Ke; Shen, Long-Chen; Zhu, Yi-Heng; Xu, Jian; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34664074

RESUMO

Accurate identification of transcription factor binding sites is of great significance in understanding gene expression, biological development and drug design. Although a variety of methods based on deep-learning models and large-scale data have been developed to predict transcription factor binding sites in DNA sequences, there is room for further improvement in prediction performance. In addition, effective interpretation of deep-learning models is greatly desirable. Here we present MAResNet, a new deep-learning method, for predicting transcription factor binding sites on 690 ChIP-seq datasets. More specifically, MAResNet combines the bottom-up and top-down attention mechanisms and a state-of-the-art feed-forward network (ResNet), which is constructed by stacking attention modules that generate attention-aware features. In particular, the multi-scale attention mechanism is utilized at the first stage to extract rich and representative sequence features. We further discuss the attention-aware features learned from different attention modules in accordance with the changes as the layers go deeper. The features learned by MAResNet are also visualized through the TMAP tool to illustrate that the method can extract the unique characteristics of transcription factor binding sites. The performance of MAResNet is extensively tested on 690 test subsets with an average AUC of 0.927, which is higher than that of the current state-of-the-art methods. Overall, this study provides a new and useful framework for the prediction of transcription factor binding sites by combining the funnel attention modules with the residual network.

Assuntos

Aprendizado Profundo , Sítios de Ligação/genética , Redes Neurais de Computação , Ligação Proteica , Fatores de Transcrição/metabolismo

3.

Comparative study on the epidemiological characteristics and hazards of respiratory syncytial virus and influenza virus infections among elderly people.

Yu, Jiangtao; Liu, Na; Zhu, Yiheng; Wang, Wenyu; Fan, Xianquan; Yuan, Xuan; Xu, Juan; Zheng, Benfeng; Luan, Lin.

BMC Infect Dis ; 24(1): 1129, 2024 Oct 09.

Artigo em Inglês | MEDLINE | ID: mdl-39385082

RESUMO

OBJECTIVE: To investigate the epidemiological characteristics and infections of respiratory syncytial virus (RSV) and influenza viruses in hospitalized elderly patients with respiratory tract infections in Suzhou City, China, and to compare the differences in clinical characteristics and economic burden associated with these two infections. METHODS: In this prospective study, pathogenetic testing and clinical data for hospitalized patients aged 60 years and older with respiratory tract infections were collected in five hospitals through stratified cluster sampling from December 2023 to May 2024. Comparative study on epidemic characteristics, clinical features and costs of cases who infected RSV alone and influenza alone were conducted. RESULTS: Among 1,894 cases included, the RSV positivity rate was 5.91% during the 2023-2024 winter-spring season, while the influenza positivity rate was 9.61%. RSV-B was the predominant subtype of RSV, and influenza A (primarily H3N2) was the dominant strain among the influenza-positive cases. Compared with cases infected influenza virus alone, those infected RSV alone had lower occurrence frequency of fever (18.8% vs. 35.7%, P = 0.004), higher occurrence frequency of complications of lower respiratory tract infections (70.8% vs. 54.8%, P = 0.011), higher direct medical costs ($996.2 vs. $841.1, P = 0.017) and total costs ($1019.7 vs. $888.1, P = 0.036). RSV single infection is more common in female cases (P = 0.007) and diabetic cases (P = 0.007) than influenza virus single infection. CONCLUSIONS: During the winter and spring months, RSV is the second most common pathogen after influenza virus among older adults hospitalized for respiratory infections in Suzhou, China. Patients infected RSV are more likely to develop complications with lower respiratory tract infections and have higher medical costs than the influenza. RSV infection in the elderly should be emphasized, especially in female patients and diabetic patients.

Assuntos

Influenza Humana , Infecções por Vírus Respiratório Sincicial , Humanos , Infecções por Vírus Respiratório Sincicial/epidemiologia , Infecções por Vírus Respiratório Sincicial/economia , Influenza Humana/epidemiologia , Influenza Humana/economia , Influenza Humana/virologia , Idoso , Feminino , Masculino , China/epidemiologia , Estudos Prospectivos , Pessoa de Meia-Idade , Idoso de 80 Anos ou mais , Hospitalização/estatística & dados numéricos , Estações do Ano , Vírus Sincicial Respiratório Humano/isolamento & purificação , Infecções Respiratórias/epidemiologia , Infecções Respiratórias/virologia

4.

Why can deep convolutional neural networks improve protein fold recognition? A visual explanation by interpretation.

Liu, Yan; Zhu, Yi-Heng; Song, Xiaoning; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(5)2021 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-33537753

RESUMO

As an essential task in protein structure and function prediction, protein fold recognition has attracted increasing attention. The majority of the existing machine learning-based protein fold recognition approaches strongly rely on handcrafted features, which depict the characteristics of different protein folds; however, effective feature extraction methods still represent the bottleneck for further performance improvement of protein fold recognition. As a powerful feature extractor, deep convolutional neural network (DCNN) can automatically extract discriminative features for fold recognition without human intervention, which has demonstrated an impressive performance on protein fold recognition. Despite the encouraging progress, DCNN often acts as a black box, and as such, it is challenging for users to understand what really happens in DCNN and why it works well for protein fold recognition. In this study, we explore the intrinsic mechanism of DCNN and explain why it works for protein fold recognition using a visual explanation technique. More specifically, we first trained a VGGNet-based DCNN model, termed VGGNet-FE, which can extract fold-specific features from the predicted protein residue-residue contact map for protein fold recognition. Subsequently, based on the trained VGGNet-FE, we implemented a new contact-assisted predictor, termed VGGfold, for protein fold recognition; we then visualized what features were extracted by each of the convolutional layers in VGGNet-FE using a deconvolution technique. Furthermore, we visualized the high-level semantic information, termed fold-discriminative region, of a predicted contact map from the localization map obtained from the last convolutional layer of VGGNet-FE. It is visually confirmed that VGGNet-FE could effectively extract distinct fold-discriminative regions for different types of protein folds, thereby accounting for the improved performance of VGGfold for protein fold recognition. In summary, this study is of great significance for both understanding the working principle of DCNNs in protein fold recognition and exploring the relationship between the predicted protein contact map and protein tertiary structure. This proposed visualization method is flexible and applicable to address other DCNN-based bioinformatics and computational biology questions. The online web server of VGGfold is freely available at http://csbio.njust.edu.cn/bioinf/vggfold/.

Assuntos

Biologia Computacional/métodos , Aprendizado de Máquina , Redes Neurais de Computação , Dobramento de Proteína , Proteínas/química , Visualização de Dados , Humanos , Mapas de Interação de Proteínas , Estrutura Terciária de Proteína , Proteínas/metabolismo , Semântica

5.

Improving protein fold recognition using triplet network and ensemble deep learning.

Liu, Yan; Han, Ke; Zhu, Yi-Heng; Zhang, Ying; Shen, Long-Chen; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(6)2021 11 05.

Artigo em Inglês | MEDLINE | ID: mdl-34226918

RESUMO

Protein fold recognition is a critical step toward protein structure and function prediction, aiming at providing the most likely fold type of the query protein. In recent years, the development of deep learning (DL) technique has led to massive advances in this important field, and accordingly, the sensitivity of protein fold recognition has been dramatically improved. Most DL-based methods take an intermediate bottleneck layer as the feature representation of proteins with new fold types. However, this strategy is indirect, inefficient and conditional on the hypothesis that the bottleneck layer's representation is assumed as a good representation of proteins with new fold types. To address the above problem, in this work, we develop a new computational framework by combining triplet network and ensemble DL. We first train a DL-based model, termed FoldNet, which employs triplet loss to train the deep convolutional network. FoldNet directly optimizes the protein fold embedding itself, making the proteins with the same fold types be closer to each other than those with different fold types in the new protein embedding space. Subsequently, using the trained FoldNet, we implement a new residue-residue contact-assisted predictor, termed FoldTR, which improves protein fold recognition. Furthermore, we propose a new ensemble DL method, termed FSD_XGBoost, which combines protein fold embedding with the other two discriminative fold-specific features extracted by two DL-based methods SSAfold and DeepFR. The Top 1 sensitivity of FSD_XGBoost increases to 74.8% at the fold level, which is ~9% higher than that of the state-of-the-art method. Together, the results suggest that fold-specific features extracted by different DL methods complement with each other, and their combination can further improve fold recognition at the fold level. The implemented web server of FoldTR and benchmark datasets are publicly available at http://csbio.njust.edu.cn/bioinf/foldtr/.

Assuntos

Biologia Computacional/métodos , Aprendizado Profundo , Modelos Moleculares , Conformação Proteica , Dobramento de Proteína , Proteínas/química , Algoritmos , Bases de Dados de Proteínas , Redes Neurais de Computação , Reprodutibilidade dos Testes , Sensibilidade e Especificidade

6.

Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features.

Zhu, Yi-Heng; Hu, Jun; Ge, Fang; Li, Fuyi; Song, Jiangning; Zhang, Yang; Yu, Dong-Jun.

Brief Bioinform ; 22(3)2021 05 20.

Artigo em Inglês | MEDLINE | ID: mdl-32436937

RESUMO

X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew's correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.

Assuntos

Biologia Computacional/métodos , Cristalização/métodos , Proteínas/química , Sequência de Aminoácidos , Cristalografia por Raios X , Bases de Dados de Proteínas , Modelos Químicos

7.

TGSA: protein-protein association-based twin graph neural networks for drug response prediction with similarity augmentation.

Zhu, Yiheng; Ouyang, Zhenqiu; Chen, Wenbo; Feng, Ruiwei; Chen, Danny Z; Cao, Ji; Wu, Jian.

Bioinformatics ; 38(2): 461-468, 2022 01 03.

Artigo em Inglês | MEDLINE | ID: mdl-34559177

RESUMO

MOTIVATION: Drug response prediction (DRP) plays an important role in precision medicine (e.g. for cancer analysis and treatment). Recent advances in deep learning algorithms make it possible to predict drug responses accurately based on genetic profiles. However, existing methods ignore the potential relationships among genes. In addition, similarity among cell lines/drugs was rarely considered explicitly. RESULTS: We propose a novel DRP framework, called TGSA, to make better use of prior domain knowledge. TGSA consists of Twin Graph neural networks for Drug Response Prediction (TGDRP) and a Similarity Augmentation (SA) module to fuse fine-grained and coarse-grained information. Specifically, TGDRP abstracts cell lines as graphs based on STRING protein-protein association networks and uses Graph Neural Networks (GNNs) for representation learning. SA views DRP as an edge regression problem on a heterogeneous graph and utilizes GNNs to smooth the representations of similar cell lines/drugs. Besides, we introduce an auxiliary pre-training strategy to remedy the identified limitations of scarce data and poor out-of-distribution generalization. Extensive experiments on the GDSC2 dataset demonstrate that our TGSA consistently outperforms all the state-of-the-art baselines under various experimental settings. We further evaluate the effectiveness and contributions of each component of TGSA via ablation experiments. The promising performance of TGSA shows enormous potential for clinical applications in precision medicine. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/violet-sto/TGSA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Neoplasias , Redes Neurais de Computação , Humanos , Algoritmos , Software , Medicina de Precisão , Proteínas

8.

GCmapCrys: Integrating graph attention network with predicted contact map for multi-stage protein crystallization propensity prediction.

Wang, Peng-Hao; Zhu, Yi-Heng; Yang, Xibei; Yu, Dong-Jun.

Anal Biochem ; 663: 115020, 2023 02 15.

Artigo em Inglês | MEDLINE | ID: mdl-36521558

RESUMO

X-ray crystallography is the major approach for atomic-level protein structure determination. Since not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity is critical to guiding the experimental design and improving the success rate of X-ray crystallography experiments. In this work, we proposed a new deep learning pipeline, GCmapCrys, for multi-stage crystallization propensity prediction through integrating graph attention network with predicted protein contact map. Experimental results on 1548 proteins with known crystallization records demonstrated that GCmapCrys increased the value of Matthew's correlation coefficient by 37.0% in average compared to state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of GCmapCrys lie in the efficiency of the graph attention network with predicted contact map, which effectively associates the residue-interaction knowledge with crystallization pattern. Meanwhile, the designed four sequence-based features can be complementary to further enhance crystallization propensity proprediction.

Assuntos

Biologia Computacional , Proteínas , Cristalização/métodos , Proteínas/química , Cristalografia por Raios X , Biologia Computacional/métodos

9.

Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.

Zhu, Yi-Heng; Zhang, Chengxin; Yu, Dong-Jun; Zhang, Yang.

PLoS Comput Biol ; 18(12): e1010793, 2022 12.

Artigo em Inglês | MEDLINE | ID: mdl-36548439

RESUMO

Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.

Assuntos

Biologia Computacional , Proteínas , Ontologia Genética , Biologia Computacional/métodos , Proteínas/genética , Proteínas/metabolismo , Redes Neurais de Computação , Idioma

10.

TargetDBP+: Enhancing the Performance of Identifying DNA-Binding Proteins via Weighted Convolutional Features.

Hu, Jun; Rao, Liang; Zhu, Yi-Heng; Zhang, Gui-Jun; Yu, Dong-Jun.

J Chem Inf Model ; 61(1): 505-515, 2021 01 25.

Artigo em Inglês | MEDLINE | ID: mdl-33410688

RESUMO

Protein-DNA interactions exist ubiquitously and play important roles in the life cycles of living cells. The accurate identification of DNA-binding proteins (DBPs) is one of the key steps to understand the mechanisms of protein-DNA interactions. Although many DBP identification methods have been proposed, the current performance is still unsatisfactory. In this study, a new method, called TargetDBP+, is developed to further enhance the performance of identifying DBPs. In TargetDBP+, five convolutional features are first extracted from five feature sources, i.e., amino acid one-hot matrix (AAOHM), position-specific scoring matrix (PSSM), predicted secondary structure probability matrix (PSSPM), predicted solvent accessibility probability matrix (PSAPM), and predicted probabilities of DNA-binding sites (PPDBSs); second, the five features are weightedly and serially combined using the weights of all of the elements learned by the differential evolution algorithm; and finally, the DBP identification model of TargetDBP+ is trained using the support vector machine (SVM) algorithm. To evaluate the developed TargetDBP+ and compare it with other existing methods, a new gold-standard benchmark data set, called UniSwiss, is constructed, which consists of 4881 DBPs and 4881 non-DBPs extracted from the UniprotKB/Swiss-Prot database. Experimental results demonstrate that TargetDBP+ can obtain an accuracy of 85.83% and precision of 88.45% covering 82.41% of all DBP data on the independent validation subset of UniSwiss, with the MCC value (0.718) being significantly higher than those of other state-of-the-art control methods. The web server of TargetDBP+ is accessible at http://csbio.njust.edu.cn/bioinf/targetdbpplus/; the UniSwiss data set and stand-alone program of TargetDBP+ are accessible at https://github.com/jun-csbio/TargetDBPplus.

Assuntos

Proteínas de Ligação a DNA , Máquina de Vetores de Suporte , Algoritmos , Sítios de Ligação , Proteínas de Ligação a DNA/metabolismo , Bases de Dados de Proteínas , Matrizes de Pontuação de Posição Específica

11.

SSCpred: Single-Sequence-Based Protein Contact Prediction Using Deep Fully Convolutional Network.

Chen, Ming-Cai; Li, Yang; Zhu, Yi-Heng; Ge, Fang; Yu, Dong-Jun.

J Chem Inf Model ; 60(6): 3295-3303, 2020 06 22.

Artigo em Inglês | MEDLINE | ID: mdl-32338512

RESUMO

There has been a significant improvement in protein residue contact prediction in recent years. Nevertheless, state-of-the-art methods still show deficiencies in the contact prediction of proteins with low-homology information. These top methods depend largely on statistical features that derived from homologous sequences, but previous studies, along with our analyses, show that they are insufficient for inferencing an accurate contact map for nonhomology protein targets. To compensate, we proposed a brand new single-sequence-based contact predictor (SSCpred) that performs prediction through the deep fully convolutional network (Deep FCN) with only the target sequence itself, i.e., without additional homology information. The proposed pipeline makes good use of the target sequence by utilizing the pair-wise encoding technique and Deep FCN. Experimental results demonstrated that SSCpred can produce accurate predictions based on the efficient pipeline. Compared with several most recent methods, SSCpred achieves completive performance on nonhomology targets. Overall, we explored the possibilities of single-sequence-based contact prediction and designed a novel pipeline without using a complex and redundant feature set. The proposed SSCpred can compensate for current methods' disadvantages and achieves better performance on the nonhomology targets. The web server of SSCpred is freely available at http://csbio.njust.edu.cn/bioinf/sscpred/.

Assuntos

Biologia Computacional , Proteínas , Algoritmos , Proteínas/genética

12.

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines.

Zhu, Yi-Heng; Hu, Jun; Song, Xiao-Ning; Yu, Dong-Jun.

J Chem Inf Model ; 59(6): 3057-3071, 2019 06 24.

Artigo em Inglês | MEDLINE | ID: mdl-30943723

RESUMO

Accurate identification of protein-DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein-DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein-DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein-DNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein-DNA binding site predictors.

Assuntos

Proteínas de Ligação a DNA/metabolismo , DNA/metabolismo , Modelos Moleculares , Máquina de Vetores de Suporte , DNA/química , Proteínas de Ligação a DNA/química , Conformação de Ácido Nucleico , Conformação Proteica

13.

Improving Antifreeze Proteins Prediction with Protein Language Models and Hybrid Feature Extraction Networks.

Wu, Jiashun; Liu, Yan; Zhu, Yiheng; Yu, Dong-Jun.

IEEE/ACM Trans Comput Biol Bioinform ; PP2024 Sep 24.

Artigo em Inglês | MEDLINE | ID: mdl-39316498

RESUMO

Accurate identification of antifreeze proteins (AFPs) is crucial in developing biomimetic synthetic anti-icing materials and low-temperature organ preservation materials. Although numerous machine learning-based methods have been proposed for AFPs prediction, the complex and diverse nature of AFPs limits the prediction performance of existing methods. In this study, we propose AFP-Deep, a new deep learning method to predict antifreeze proteins by integrating embedding from protein sequences with pre-trained protein language models and evolutionary contexts with hybrid feature extraction networks. The experimental results demonstrated that the main advantage of AFP-Deep is its utilization of pre-trained protein language models, which can extract discriminative global contextual features from protein sequences. Additionally, the hybrid deep neural networks designed for protein language models and evolutionary context feature extraction enhance the correlation between embeddings and antifreeze pattern. The performance evaluation results show that AFP-Deep achieves superior performance compared to state-of-the-art models on benchmark datasets, achieving an AUPRC of 0.724 and 0.924, respectively.

14.

A unique cell division protein critical for the assembly of the bacterial divisome.

Chu, Xiao; Wang, Lidong; Zhu, Yiheng; Feng, Zhengshan; Guan, Qingtian; Song, Lei; Luo, Zhaoqing.

Elife ; 122024 Oct 03.

Artigo em Inglês | MEDLINE | ID: mdl-39361022

RESUMO

Identification of unique essential bacterial genes is important for not only the understanding of their cell biology but also the development of new antimicrobials. Here, we report a previously unrecognized core component of the Acinetobacter baumannii divisome. Our results reveal that the protein, termed Aeg1 interacts with multiple cell division proteins, including FtsN, which is required for components of the divisome to localize to the midcell. We demonstrate that the FtsAE202K and FtsBE65A mutants effectively bypassed the need of Aeg1 by A. baumannii, as did the activation variants FtsWM254I and FtsWS274G. Our results suggest that Aeg1 is a cell division protein that arrives at the division site to initiate cell division by recruiting FtsN, which activates FtsQLB and FtsA to induce the septal peptidoglycan synthase FtsWI. The discovery of the new essential cell division protein has provided a new target for the development of antibacterial agents.

Assuntos

Acinetobacter baumannii , Proteínas de Bactérias , Divisão Celular , Proteínas de Bactérias/metabolismo , Proteínas de Bactérias/genética , Acinetobacter baumannii/metabolismo , Acinetobacter baumannii/genética , Proteínas de Ciclo Celular/metabolismo , Proteínas de Ciclo Celular/genética , Proteínas de Membrana/metabolismo , Proteínas de Membrana/genética

15.

BLAM6A-Merge: Leveraging Attention Mechanisms and Feature Fusion Strategies to Improve the Identification of RNA N6-methyladenosine Sites.

Xia, Yunpeng; Zhang, Ying; Liu, Dian; Zhu, Yi-Heng; Wang, Zhikang; Song, Jiangning; Yu, Dong-Jun.

IEEE/ACM Trans Comput Biol Bioinform ; PP2024 Jun 24.

Artigo em Inglês | MEDLINE | ID: mdl-38913512

RESUMO

RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites. The data and code can be accessed at https://github.com/DoraemonXia/BLAM6A-Merge.

16.

Integrating unsupervised language model with multi-view multiple sequence alignments for high-accuracy inter-chain contact prediction.

Liu, Zi; Zhu, Yi-Heng; Shen, Long-Chen; Xiao, Xuan; Qiu, Wang-Ren; Yu, Dong-Jun.

Comput Biol Med ; 166: 107529, 2023 Sep 20.

Artigo em Inglês | MEDLINE | ID: mdl-37748220

RESUMO

Accurate identification of inter-chain contacts in the protein complex is critical to determine the corresponding 3D structures and understand the biological functions. We proposed a new deep learning method, ICCPred, to deduce the inter-chain contacts from the amino acid sequences of the protein complex. This pipeline was built on the designed deep residual network architecture, integrating the pre-trained language model with three multiple sequence alignments (MSAs) from different biological views. Experimental results on 709 non-redundant benchmarking protein complexes showed that the proposed ICCPred significantly increased inter-chain contact prediction accuracy compared to the state-of-the-art approaches. Detailed data analyses showed that the significant advantage of ICCPred lies in the utilization of pre-trained transformer language models which can effectively extract the complementary co-evolution diversity from three MSAs. Meanwhile, the designed deep residual network enhances the correlation between the co-evolution diversity and the patterns of inter-chain contacts. These results demonstrated a new avenue for high-accuracy deep-learning inter-chain contact prediction that is applicable to large-scale protein-protein interaction annotations from sequence alone.

17.

Spatial and Temporal Evolutionary Characteristics and Its Influencing Factors of Economic Spatial Polarization in the Yangtze River Delta Region.

Zhu, Yiheng; Yang, Shan; Lin, Jinping; Yin, Shanggang.

Int J Environ Res Public Health ; 19(12)2022 06 07.

Artigo em Inglês | MEDLINE | ID: mdl-35742246

RESUMO

Economic spatial polarization is a manifestation of unbalanced urban development. To study the unbalanced development of Chinese cities, this paper selects 41 cities in the Yangtze River Delta (YRD) region, introduces the polarization index and exploratory spatio-temporal analysis to portray their spatio-temporal evolution process, and analyzes the differences in spatial polarization patterns of economic development in three dimensions of economic quantity, quality, and structure. Finally, we use the geographic detector model to explore the driving factors and then propose corresponding policy recommendations. The results show that: (1) the degree of difference in economic development in the YRD region narrowed from 2000 to 2019, and the spatial polarization level of urban economic development showed a fluctuating downward trend, among which the spatial polarization level of the economic structure dimension has been increasing. (2) In terms of spatial distribution, the "Yangtze River Delta urban agglomeration" has economic spatial polarization in the YRD region has become the peak contiguous zone, and the spatial polarization of economic quantity and quality dimensions has formed a "polycentric" pattern, while the spatial polarization of economic structure dimensions shows a stable "one core, multiple sub-center" distribution. (3) From the evolution of spatial polarization, most cities have strong spatial locking characteristics without a transition. Spatially positive polarized are concentrated in the YRD urban agglomeration, and the inter-city neighboring relations are mainly positive synergistic growth, while the negatively polarized cities are mostly distributed in the peripheral areas of the YRD and the neighboring relations are negative synergistic growth. At the same time, the spatially positive polarization effect of the economic quantity dimension and the spatially negative polarization effect of the economic structure dimension among cities are more significant. (4) The economic spatial polarization in the YRD region is mainly dominated by market prosperity and urbanization level, while the driving effect of scientific and technological innovation development on the urban economy has also been expanding in recent years. Promoting the reasonable allocation of marketization, urbanization, and technology among cities with positive and negative spatial polarization in the future will contribute to balanced urban and regional economic development in a coordinated and orderly manner.

Assuntos

Rios , Urbanização , China , Cidades , Desenvolvimento Econômico

18.

TargetMM: Accurate Missense Mutation Prediction by Utilizing Local and Global Sequence Information with Classifier Ensemble.

Ge, Fang; Hu, Jun; Zhu, Yi-Heng; Arif, Muhammad; Yu, Dong-Jun.

Comb Chem High Throughput Screen ; 25(1): 38-52, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-33280588

RESUMO

AIM AND OBJECTIVE: Missense mutation (MM) may lead to various human diseases by disabling proteins. Accurate prediction of MM is important and challenging for both protein function annotation and drug design. Although several computational methods yielded acceptable success rates, there is still room for further enhancing the prediction performance of MM. MATERIALS AND METHODS: In the present study, we designed a new feature extracting method, which considers the impact degree of residues in the microenvironment range to the mutation site. Stringent cross-validation and independent test on benchmark datasets were performed to evaluate the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous prediction models were trained and then ensembled for the final prediction. By combining the feature representation method and classifier ensemble technique, we reported a novel MM predictor called TargetMM for identifying the pathogenic mutations from the neutral ones. RESULTS: Comparison outcomes based on statistical evaluation demonstrate that TargetMM outperforms the prior advanced methods on the independent test data. The source codes and benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git for academic use.

Assuntos

Algoritmos , Mutação de Sentido Incorreto , Humanos , Proteínas/química , Software

19.

TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction.

Zhu, Yi-Heng; Zhang, Chengxin; Liu, Yan; Omenn, Gilbert S; Freddolino, Peter L; Yu, Dong-Jun; Zhang, Yang.

Genomics Proteomics Bioinformatics ; 20(5): 1013-1027, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-35568117

RESUMO

Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.

Assuntos

Biologia Computacional , Proteínas , Animais , Camundongos , Ratos , Humanos , Proteínas/metabolismo , Anotação de Sequência Molecular , Sequência de Aminoácidos , Alinhamento de Sequência , Biologia Computacional/métodos

20.

MutTMPredictor: Robust and accurate cascade XGBoost classifier for prediction of mutations in transmembrane proteins.

Ge, Fang; Zhu, Yi-Heng; Xu, Jian; Muhammad, Arif; Song, Jiangning; Yu, Dong-Jun.

Comput Struct Biotechnol J ; 19: 6400-6416, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34938415

RESUMO

Transmembrane proteins have critical biological functions and play a role in a multitude of cellular processes including cell signaling, transport of molecules and ions across membranes. Approximately 60% of transmembrane proteins are considered as drug targets. Missense mutations in such proteins can lead to many diverse diseases and disorders, such as neurodegenerative diseases and cystic fibrosis. However, there are limited studies on mutations in transmembrane proteins. In this work, we first design a new feature encoding method, termed weight attenuation position-specific scoring matrix (WAPSSM), which builds upon the protein evolutionary information. Then, we propose a new mutation prediction algorithm (cascade XGBoost) by leveraging the idea learned from consensus predictors and gcForest. Multi-level experiments illustrate the effectiveness of WAPSSM and cascade XGBoost algorithms. Finally, based on WAPSSM and other three types of features, in combination with the cascade XGBoost algorithm, we develop a new transmembrane protein mutation predictor, named MutTMPredictor. We benchmark the performance of MutTMPredictor against several existing predictors on seven datasets. On the 546 mutations dataset, MutTMPredictor achieves the accuracy (ACC) of 0.9661 and the Matthew's Correlation Coefficient (MCC) of 0.8950. While on the 67,584 dataset, MutTMPredictor achieves an MCC of 0.7523 and area under curve (AUC) of 0.8746, which are 0.1625 and 0.0801 respectively higher than those of the existing best predictor (fathmm). Besides, MutTMPredictor also outperforms two specific predictors on the Pred-MutHTP datasets. The results suggest that MutTMPredictor can be used as an effective method for predicting and prioritizing missense mutations in transmembrane proteins. The MutTMPredictor webserver and datasets are freely accessible at http://csbio.njust.edu.cn/bioinf/muttmpredictor/ for academic use.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa