Pesquisa | BVS Educação Profissional em Saúde

MLm5C: A high-precision human RNA 5-methylcytosine sites predictor based on a combination of hybrid machine learning models.

Kurata, Hiroyuki; Harun-Or-Roshid, Md; Mehedi Hasan, Md; Tsukiyama, Sho; Maeda, Kazuhiro; Manavalan, Balachandran.

Methods ; 227: 37-47, 2024 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-38729455

RESUMO

RNA modification serves as a pivotal component in numerous biological processes. Among the prevalent modifications, 5-methylcytosine (m5C) significantly influences mRNA export, translation efficiency and cell differentiation and are also associated with human diseases, including Alzheimer's disease, autoimmune disease, cancer, and cardiovascular diseases. Identification of m5C is critically responsible for understanding the RNA modification mechanisms and the epigenetic regulation of associated diseases. However, the large-scale experimental identification of m5C present significant challenges due to labor intensity and time requirements. Several computational tools, using machine learning, have been developed to supplement experimental methods, but identifying these sites lack accuracy and efficiency. In this study, we introduce a new predictor, MLm5C, for precise prediction of m5C sites using sequence data. Briefly, we evaluated eleven RNA sequence-derived features with four basic machine learning algorithms to generate baseline models. From these 44 models, we ranked them based on their performance and subsequently stacked the Top 20 baseline models as the best model, named MLm5C. The MLm5C outperformed the-state-of-the-art predictors. Notably, the optimization of the sequence length surrounding the modification sites significantly improved the prediction performance. MLm5C is an invaluable tool in accelerating the detection of m5C sites within the human genome, thereby facilitating in the characterization of their roles in post-transcriptional regulation.

Assuntos

5-Metilcitosina , Aprendizado de Máquina , RNA , Humanos , 5-Metilcitosina/metabolismo , 5-Metilcitosina/química , RNA/genética , RNA/química , RNA/metabolismo , Biologia Computacional/métodos , Processamento Pós-Transcricional do RNA , Algoritmos

iACVP: markedly enhanced identification of anti-coronavirus peptides using a dataset-specific word2vec model.

Kurata, Hiroyuki; Tsukiyama, Sho; Manavalan, Balachandran.

Brief Bioinform ; 23(4)2022 07 18.

Artigo em Inglês | MEDLINE | ID: mdl-35772910

RESUMO

The COVID-19 pandemic caused several million deaths worldwide. Development of anti-coronavirus drugs is thus urgent. Unlike conventional non-peptide drugs, antiviral peptide drugs are highly specific, easy to synthesize and modify, and not highly susceptible to drug resistance. To reduce the time and expense involved in screening thousands of peptides and assaying their antiviral activity, computational predictors for identifying anti-coronavirus peptides (ACVPs) are needed. However, few experimentally verified ACVP samples are available, even though a relatively large number of antiviral peptides (AVPs) have been discovered. In this study, we attempted to predict ACVPs using an AVP dataset and a small collection of ACVPs. Using conventional features, a binary profile and a word-embedding word2vec (W2V), we systematically explored five different machine learning methods: Transformer, Convolutional Neural Network, bidirectional Long Short-Term Memory, Random Forest (RF) and Support Vector Machine. Via exhaustive searches, we found that the RF classifier with W2V consistently achieved better performance on different datasets. The two main controlling factors were: (i) the dataset-specific W2V dictionary was generated from the training and independent test datasets instead of the widely used general UniProt proteome and (ii) a systematic search was conducted and determined the optimal k-mer value in W2V, which provides greater discrimination between positive and negative samples. Therefore, our proposed method, named iACVP, consistently provides better prediction performance compared with existing state-of-the-art methods. To assist experimentalists in identifying putative ACVPs, we implemented our model as a web server accessible via the following link: http://kurata35.bio.kyutech.ac.jp/iACVP.

Assuntos

Tratamento Farmacológico da COVID-19 , Pandemias , Antivirais/farmacologia , Humanos , Aprendizado de Máquina , Peptídeos

BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches.

Tsukiyama, Sho; Hasan, Md Mehedi; Deng, Hong-Wen; Kurata, Hiroyuki.

Brief Bioinform ; 23(2)2022 03 10.

Artigo em Inglês | MEDLINE | ID: mdl-35225328

RESUMO

N6-methyladenine (6mA) is associated with important roles in DNA replication, DNA repair, transcription, regulation of gene expression. Several experimental methods were used to identify DNA modifications. However, these experimental methods are costly and time-consuming. To detect the 6mA and complement these shortcomings of experimental methods, we proposed a novel, deep leaning approach called BERT6mA. To compare the BERT6mA with other deep learning approaches, we used the benchmark datasets including 11 species. The BERT6mA presented the highest AUCs in eight species in independent tests. Furthermore, BERT6mA showed higher and comparable performance with the state-of-the-art models while the BERT6mA showed poor performances in a few species with a small sample size. To overcome this issue, pretraining and fine-tuning between two species were applied to the BERT6mA. The pretrained and fine-tuned models on specific species presented higher performances than other models even for the species with a small sample size. In addition to the prediction, we analyzed the attention weights generated by BERT6mA to reveal how the BERT6mA model extracts critical features responsible for the 6mA prediction. To facilitate biological sciences, the BERT6mA online web server and its source codes are freely accessible at https://github.com/kuratahiroyuki/BERT6mA.git, respectively.

Assuntos

Aprendizado Profundo , DNA/genética , Metilação de DNA , Software

LSTM-PHV: prediction of human-virus protein-protein interactions by LSTM with word2vec.

Tsukiyama, Sho; Hasan, Md Mehedi; Fujii, Satoshi; Kurata, Hiroyuki.

Brief Bioinform ; 22(6)2021 11 05.

Artigo em Inglês | MEDLINE | ID: mdl-34160596

RESUMO

Viral infection involves a large number of protein-protein interactions (PPIs) between human and virus. The PPIs range from the initial binding of viral coat proteins to host membrane receptors to the hijacking of host transcription machinery. However, few interspecies PPIs have been identified, because experimental methods including mass spectrometry are time-consuming and expensive, and molecular dynamic simulation is limited only to the proteins whose 3D structures are solved. Sequence-based machine learning methods are expected to overcome these problems. We have first developed the LSTM model with word2vec to predict PPIs between human and virus, named LSTM-PHV, by using amino acid sequences alone. The LSTM-PHV effectively learnt the training data with a highly imbalanced ratio of positive to negative samples and achieved AUCs of 0.976 and 0.973 and accuracies of 0.984 and 0.985 on the training and independent datasets, respectively. In predicting PPIs between human and unknown or new virus, the LSTM-PHV learned greatly outperformed the existing state-of-the-art PPI predictors. Interestingly, learning of only sequence contexts as words is sufficient for PPI prediction. Use of uniform manifold approximation and projection demonstrated that the LSTM-PHV clearly distinguished the positive PPI samples from the negative ones. We presented the LSTM-PHV online web server and support data that are freely available at http://kurata35.bio.kyutech.ac.jp/LSTM-PHV.

Assuntos

Biologia Computacional/métodos , Interações Hospedeiro-Patógeno , Mapeamento de Interação de Proteínas/métodos , Software , Proteínas Virais/metabolismo , Viroses/metabolismo , Viroses/virologia , Algoritmos , Sequência de Aminoácidos , Benchmarking , Bases de Dados de Proteínas , Aprendizado Profundo , Humanos , Domínios e Motivos de Interação entre Proteínas , Mapas de Interação de Proteínas , Reprodutibilidade dos Testes , Navegador

Deepm5C: A deep-learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy.

Hasan, Md Mehedi; Tsukiyama, Sho; Cho, Jae Youl; Kurata, Hiroyuki; Alam, Md Ashad; Liu, Xiaowen; Manavalan, Balachandran; Deng, Hong-Wen.

Mol Ther ; 30(8): 2856-2867, 2022 08 03.

Artigo em Inglês | MEDLINE | ID: mdl-35526094

RESUMO

As one of the most prevalent post-transcriptional epigenetic modifications, N5-methylcytosine (m5C) plays an essential role in various cellular processes and disease pathogenesis. Therefore, it is important accurately identify m5C modifications in order to gain a deeper understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models have been developed using small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we propose Deepm5C, a bioinformatics method for identifying RNA m5C sites throughout the human genome. To develop Deepm5C, we constructed a novel benchmarking dataset and investigated a mixture of three conventional feature-encoding algorithms and a feature derived from word-embedding approaches. Afterward, four variants of deep-learning classifiers and four commonly used conventional classifiers were employed and trained with the four encodings, ultimately obtaining 32 baseline models. A stacking strategy is effectively utilized by integrating the predicted output of the optimal baseline models and trained with a one-dimensional (1D) convolutional neural network. As a result, the Deepm5C predictor achieved excellent performance during cross-validation with a Matthews correlation coefficient and an accuracy of 0.697 and 0.855, respectively. The corresponding metrics during the independent test were 0.691 and 0.852, respectively. Overall, Deepm5C achieved a more accurate and stable performance than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, Deepm5C is expected to assist community-wide efforts in identifying putative m5Cs and to formulate the novel testable biological hypothesis.

Assuntos

Aprendizado Profundo , RNA , Algoritmos , Biologia Computacional/métodos , Humanos , Aprendizado de Máquina , RNA/genética

PredIL13: Stacking a variety of machine and deep learning methods with ESM-2 language model for identifying IL13-inducing peptides.

Kurata, Hiroyuki; Harun-Or-Roshid, Md; Tsukiyama, Sho; Maeda, Kazuhiro.

PLoS One ; 19(8): e0309078, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-39172871

RESUMO

Interleukin (IL)-13 has emerged as one of the recently identified cytokine. Since IL-13 causes the severity of COVID-19 and alters crucial biological processes, it is urgent to explore novel molecules or peptides capable of including IL-13. Computational prediction has received attention as a complementary method to in-vivo and in-vitro experimental identification of IL-13 inducing peptides, because experimental identification is time-consuming, laborious, and expensive. A few computational tools have been presented, including the IL13Pred and iIL13Pred. To increase prediction capability, we have developed PredIL13, a cutting-edge ensemble learning method with the latest ESM-2 protein language model. This method stacked the probability scores outputted by 168 single-feature machine/deep learning models, and then trained a logistic regression-based meta-classifier with the stacked probability score vectors. The key technology was to implement ESM-2 and to select the optimal single-feature models according to their absolute weight coefficient for logistic regression (AWCLR), an indicator of the importance of each single-feature model. Especially, the sequential deletion of single-feature models based on the iterative AWCLR ranking (SDIWC) method constructed the meta-classifier consisting of the top 16 single-feature models, named PredIL13, while considering the model's accuracy. The PredIL13 greatly outperformed the-state-of-the-art predictors, thus is an invaluable tool for accelerating the detection of IL13-inducing peptide within the human genome.

Assuntos

Aprendizado Profundo , Interleucina-13 , Peptídeos , Humanos , Biologia Computacional/métodos , COVID-19/imunologia , Interleucina-13/metabolismo , Modelos Logísticos , Aprendizado de Máquina , Software

CNN6mA: Interpretable neural network model based on position-specific CNN and cross-interactive network for 6mA site prediction.

Tsukiyama, Sho; Hasan, Md Mehedi; Kurata, Hiroyuki.

Comput Struct Biotechnol J ; 21: 644-654, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36659917

RESUMO

N6-methyladenine (6mA) plays a critical role in various epigenetic processing including DNA replication, DNA repair, silencing, transcription, and diseases such as cancer. To understand such epigenetic mechanisms, 6 mA has been detected by high-throughput technologies on a genome-wide scale at single-base resolution, together with conventional methods such as immunoprecipitation, mass spectrometry and capillary electrophoresis, but these experimental approaches are time-consuming and laborious. To complement these problems, we have developed a CNN-based 6 mA site predictor, named CNN6mA, which proposed two new architectures: a position-specific 1-D convolutional layer and a cross-interactive network. In the position-specific 1-D convolutional layer, position-specific filters with different window sizes were applied to an inquiry sequence instead of sharing the same filters over all positions in order to extract the position-specific features at different levels. The cross-interactive network explored the relationships between all the nucleotide patterns within the inquiry sequence. Consequently, CNN6mA outperformed the existing state-of-the-art models in many species and created the contribution score vector that intelligibly interpret the prediction mechanism. The source codes and web application in CNN6mA are freely accessible at https://github.com/kuratahiroyuki/CNN6mA.git and http://kurata35.bio.kyutech.ac.jp/CNN6mA/, respectively.

Cross-attention PHV: Prediction of human and virus protein-protein interactions using cross-attention-based neural networks.

Tsukiyama, Sho; Kurata, Hiroyuki.

Comput Struct Biotechnol J ; 20: 5564-5573, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36249566

RESUMO

Viral infections represent a major health concern worldwide. The alarming rate at which SARS-CoV-2 spreads, for example, led to a worldwide pandemic. Viruses incorporate genetic material into the host genome to hijack host cell functions such as the cell cycle and apoptosis. In these viral processes, protein-protein interactions (PPIs) play critical roles. Therefore, the identification of PPIs between humans and viruses is crucial for understanding the infection mechanism and host immune responses to viral infections and for discovering effective drugs. Experimental methods including mass spectrometry-based proteomics and yeast two-hybrid assays are widely used to identify human-virus PPIs, but these experimental methods are time-consuming, expensive, and laborious. To overcome this problem, we developed a novel computational predictor, named cross-attention PHV, by implementing two key technologies of the cross-attention mechanism and a one-dimensional convolutional neural network (1D-CNN). The cross-attention mechanisms were very effective in enhancing prediction and generalization abilities. Application of 1D-CNN to the word2vec-generated feature matrices reduced computational costs, thus extending the allowable length of protein sequences to 9000 amino acid residues. Cross-attention PHV outperformed existing state-of-the-art models using a benchmark dataset and accurately predicted PPIs for unknown viruses. Cross-attention PHV also predicted human-SARS-CoV-2 PPIs with area under the curve values >0.95. The Cross-attention PHV web server and source codes are freely available at https://kurata35.bio.kyutech.ac.jp/Cross-attention_PHV/ and https://github.com/kuratahiroyuki/Cross-Attention_PHV, respectively.

ICAN: Interpretable cross-attention network for identifying drug and target protein interactions.

Kurata, Hiroyuki; Tsukiyama, Sho.

PLoS One ; 17(10): e0276609, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36279284

RESUMO

Drug-target protein interaction (DTI) identification is fundamental for drug discovery and drug repositioning, because therapeutic drugs act on disease-causing proteins. However, the DTI identification process often requires expensive and time-consuming tasks, including biological experiments involving large numbers of candidate compounds. Thus, a variety of computation approaches have been developed. Of the many approaches available, chemo-genomics feature-based methods have attracted considerable attention. These methods compute the feature descriptors of drugs and proteins as the input data to train machine and deep learning models to enable accurate prediction of unknown DTIs. In addition, attention-based learning methods have been proposed to identify and interpret DTI mechanisms. However, improvements are needed for enhancing prediction performance and DTI mechanism elucidation. To address these problems, we developed an attention-based method designated the interpretable cross-attention network (ICAN), which predicts DTIs using the Simplified Molecular Input Line Entry System of drugs and amino acid sequences of target proteins. We optimized the attention mechanism architecture by exploring the cross-attention or self-attention, attention layer depth, and selection of the context matrixes from the attention mechanism. We found that a plain attention mechanism that decodes drug-related protein context features without any protein-related drug context features effectively achieved high performance. The ICAN outperformed state-of-the-art methods in several metrics on the DAVIS dataset and first revealed with statistical significance that some weighted sites in the cross-attention weight matrix represent experimental binding sites, thus demonstrating the high interpretability of the results. The program is freely available at https://github.com/kuratahiroyuki/ICAN.

Assuntos

Descoberta de Drogas , Proteínas , Simulação por Computador , Proteínas/metabolismo , Descoberta de Drogas/métodos , Sequência de Aminoácidos , Reposicionamento de Medicamentos , Interações Medicamentosas

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA