Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 35
Filtrar
1.
BMC Genomics ; 25(1): 531, 2024 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-38816689

RESUMO

Non-coding RNAs (ncRNAs) are recognized as pivotal players in the regulation of essential physiological processes such as nutrient homeostasis, development, and stress responses in plants. Common methods for predicting ncRNAs are susceptible to significant effects of experimental conditions and computational methods, resulting in the need for significant investment of time and resources. Therefore, we constructed an ncRNA predictor(MFPINC), to predict potential ncRNA in plants which is based on the PINC tool proposed by our previous studies. Specifically, sequence features were carefully refined using variance thresholding and F-test methods, while deep features were extracted and feature fusion were performed by applying the GRU model. The comprehensive evaluation of multiple standard datasets shows that MFPINC not only achieves more comprehensive and accurate identification of gene sequences, but also significantly improves the expressive and generalization performance of the model, and MFPINC significantly outperforms the existing competing methods in ncRNA identification. In addition, it is worth mentioning that our tool can also be found on Github ( https://github.com/Zhenj-Nie/MFPINC ) the data and source code can also be downloaded for free.


Assuntos
Biologia Computacional , RNA de Plantas , RNA não Traduzido , RNA não Traduzido/genética , RNA de Plantas/genética , Biologia Computacional/métodos , Software , Plantas/genética , Algoritmos , Análise de Sequência de RNA/métodos
2.
Microrna ; 13(2): 155-165, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38591194

RESUMO

BACKGROUND: Long non-coding RNA (lncRNA) plays a crucial role in various biological processes, and mutations or imbalances of lncRNAs can lead to several diseases, including cancer, Prader-Willi syndrome, autism, Alzheimer's disease, cartilage-hair hypoplasia, and hearing loss. Understanding lncRNA-protein interactions (LPIs) is vital for elucidating basic cellular processes, human diseases, viral replication, transcription, and plant pathogen resistance. Despite the development of several LPI calculation methods, predicting LPI remains challenging, with the selection of variables and deep learning structure being the focus of LPI research. METHODS: We propose a deep learning framework called AR-LPI, which extracts sequence and secondary structure features of proteins and lncRNAs. The framework utilizes an auto-encoder for feature extraction and employs SE-ResNet for prediction. Additionally, we apply transfer learning to the deep neural network SE-ResNet for predicting small-sample datasets. RESULTS: Through comprehensive experimental comparison, we demonstrate that the AR-LPI architecture performs better in LPI prediction. Specifically, the accuracy of AR-LPI increases by 2.86% to 94.52%, while the F-value of AR-LPI increases by 2.71% to 94.73%. CONCLUSION: Our experimental results show that the overall performance of AR-LPI is better than that of other LPI prediction tools.


Assuntos
Aprendizado Profundo , RNA Longo não Codificante , RNA Longo não Codificante/genética , Humanos , Redes Neurais de Computação , Biologia Computacional/métodos
3.
J Biol Chem ; 300(4): 107140, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38447795

RESUMO

RNA modification, a posttranscriptional regulatory mechanism, significantly influences RNA biogenesis and function. The accurate identification of modification sites is paramount for investigating their biological implications. Methods for encoding RNA sequence into numerical data play a crucial role in developing robust models for predicting modification sites. However, existing techniques suffer from limitations, including inadequate information representation, challenges in effectively integrating positional and sequential information, and the generation of irrelevant or redundant features when combining multiple approaches. These deficiencies hinder the effectiveness of machine learning models in addressing the performance challenges associated with predicting RNA modification sites. Here, we introduce a novel RNA sequence feature representation method, named BiPSTP, which utilizes bidirectional trinucleotide position-specific propensities. We employ the parameter ξ to denote the interval between the current nucleotide and its adjacent forward or backward dinucleotide, enabling the extraction of positional and sequential information from RNA sequences. Leveraging the BiPSTP method, we have developed the prediction model mRNAPred using support vector machine classifier to identify multiple types of RNA modification sites. We evaluate the performance of our BiPSTP method and mRNAPred model across 12 distinct RNA modification types. Our experimental results demonstrate the superiority of the mRNAPred model compared to state-of-art models in the domain of RNA modification sites identification. Importantly, our BiPSTP method enhances the robustness and generalization performance of prediction models. Notably, it can be applied to feature extraction from DNA sequences to predict other biological modification sites.


Assuntos
Processamento Pós-Transcricional do RNA , RNA , Máquina de Vetores de Suporte , Biologia Computacional/métodos , RNA/química , RNA/genética , RNA/metabolismo , Análise de Sequência de RNA/métodos , Nucleotídeos/química , Nucleotídeos/metabolismo
4.
BMC Bioinformatics ; 25(1): 39, 2024 Jan 23.
Artigo em Inglês | MEDLINE | ID: mdl-38262923

RESUMO

BACKGROUND: Drug-drug interactions (DDI) are prevalent in combination therapy, necessitating the importance of identifying and predicting potential DDI. While various artificial intelligence methods can predict and identify potential DDI, they often overlook the sequence information of drug molecules and fail to comprehensively consider the contribution of molecular substructures to DDI. RESULTS: In this paper, we proposed a novel model for DDI prediction based on sequence and substructure features (SSF-DDI) to address these issues. Our model integrates drug sequence features and structural features from the drug molecule graph, providing enhanced information for DDI prediction and enabling a more comprehensive and accurate representation of drug molecules. CONCLUSION: The results of experiments and case studies have demonstrated that SSF-DDI significantly outperforms state-of-the-art DDI prediction models across multiple real datasets and settings. SSF-DDI performs better in predicting DDI involving unknown drugs, resulting in a 5.67% improvement in accuracy compared to state-of-the-art methods.


Assuntos
Inteligência Artificial , Aprendizado Profundo , Interações Medicamentosas
5.
Front Microbiol ; 14: 1277099, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37937221

RESUMO

5-Methyluridine (m5U) is one of the most common post-transcriptional RNA modifications, which is involved in a variety of important biological processes and disease development. The precise identification of the m5U sites allows for a better understanding of the biological processes of RNA and contributes to the discovery of new RNA functional and therapeutic targets. Here, we present m5U-GEPred, a prediction framework, to combine sequence characteristics and graph embedding-based information for m5U identification. The graph embedding approach was introduced to extract the global information of training data that complemented the local information represented by conventional sequence features, thereby enhancing the prediction performance of m5U identification. m5U-GEPred outperformed the state-of-the-art m5U predictors built on two independent species, with an average AUROC of 0.984 and 0.985 tested on human and yeast transcriptomes, respectively. To further validate the performance of our newly proposed framework, the experimentally validated m5U sites identified from Oxford Nanopore Technology (ONT) were collected as independent testing data, and in this project, m5U-GEPred achieved reasonable prediction performance with ACC of 91.84%. We hope that m5U-GEPred should make a useful computational alternative for m5U identification.

6.
Math Biosci Eng ; 20(8): 14734-14755, 2023 07 07.
Artigo em Inglês | MEDLINE | ID: mdl-37679156

RESUMO

Protein interactions are the foundation of all metabolic activities of cells, such as apoptosis, the immune response, and metabolic pathways. In order to optimize the performance of protein interaction prediction, a coding method based on normalized difference sequence characteristics (NDSF) of amino acid sequences is proposed. By using the positional relationships between amino acids in the sequences and the correlation characteristics between sequence pairs, NDSF is jointly encoded. Using principal component analysis (PCA) and local linear embedding (LLE) dimensionality reduction methods, the coded 174-dimensional human protein sequence vector is extracted using sequence features. This study compares the classification performance of four ensemble learning methods (AdaBoost, Extra trees, LightGBM, XGBoost) applied to PCA and LLE features. Cross-validation and grid search methods are used to find the best combination of parameters. The results show that the accuracy of NDSF is generally higher than that of the sequence matrix-based coding method (MOS) coding method, and the loss and coding time can be greatly reduced. The bar chart of feature extraction shows that the classification accuracy is significantly higher when using the linear dimensionality reduction method, PCA, compared to the nonlinear dimensionality reduction method, LLE. After classification with XGBoost, the model accuracy reaches 99.2%, which provides the best performance among all models. This study suggests that NDSF combined with PCA and XGBoost may be an effective strategy for classifying different human protein interactions.


Assuntos
Aminoácidos , Projetos de Pesquisa , Humanos , Sequência de Aminoácidos , Apoptose , Sistemas Computacionais
7.
BMC Plant Biol ; 23(1): 81, 2023 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-36750954

RESUMO

BACKGROUND: The cytoplasmic male sterility (CMS) of rice is caused by chimeric mitochondrial DNA (mtDNA) that is maternally inherited in the majority of multicellular organisms. Wild rice (Oryza rufipogon Griff.) has been regarded as the ancestral progenitor of Asian cultivated rice (Oryza sativa L.). To investigate the distribution of original CMS source, and explore the origin of gametophytic CMS gene, a total of 427 individuals with seventeen representative populations of O. rufipogon were collected in from Dongxiang of Jiangxi Province to Sanya of Hainan Province, China, for the PCR amplification of atp6, orfH79 and B-atp6-orfH79, respectively. RESULTS: The B-atp6-orfH79 and its variants (B-atp6-GSV) were detected in five among seventeen populations (i.e. HK, GZ, PS, TL and YJ) through PCR amplification, which could be divided into three haplotypes, i.e., BH1, BH2, and BH3. The BH2 haplotype was identical to B-atp6-orfH79, while the BH1 and BH3 were the novel haplotypes of B-atp6-GSV. Combined with the high-homology sequences in GenBank, a total of eighteen haplotypes have been revealed, only with ten haplotypes in orfH79 and its variants (GSV) that belong to three species (i.e. O. rufipogon, Oryza nivara and Oryza sativa). Enough haplotypes clearly demonstrated the uniform structural characteristics of the B-atp6-orfH79 as follows: except for the conserved sequence (671 bp) composed of B-atp6 (619 bp) and the downstream followed the B-atp6 (52 bp, DS), and GSV sequence, a rich variable sequence (VS, 176 bp) lies between the DS and GSV with five insertion or deletion and more than 30 single nucleotide polymorphism. Maximum likelihood analysis showed that eighteen haplotypes formed three clades with high support rate. The hierarchical analysis of molecular variance (AMOVA) indicated the occurrence of variation among all populations (FST = 1; P < 0.001), which implied that the chimeric structure occurred independently. Three haplotypes (i.e., H1, H2 and H3) were detected by the primer of orfH79, which were identical to the GVS in B-atp6-GVS structure, respectively. All seventeen haplotypes of the orfH79, belonged to six species based on our results and the existing references. Seven existed single nucleotide polymorphism in GSV section can be translated into eleven various amino acid sequences. CONCLUSIONS: Generally, this study, indicating that orfH79 was always accompanied by the B-atp6, not only provide two original CMS sources for rice breeding, but also confirm the uniform structure of B-atp-orfH79, which contribute to revealing the origin of rice gametophytic CMS genes, and the reason about frequent recombination of mitochondrial DNA.


Assuntos
Oryza , DNA Mitocondrial/genética , DNA Mitocondrial/metabolismo , Mitocôndrias/metabolismo , Oryza/genética , Melhoramento Vegetal
8.
Micromachines (Basel) ; 14(1)2023 Jan 14.
Artigo em Inglês | MEDLINE | ID: mdl-36677278

RESUMO

Accurately recognizing the hardness and type of different objects by tactile sensors is of great significance in human-machine interaction. In this paper, a novel porous graphene flexible tactile sensor array with great performance is designed and fabricated, and it is mounted on a two-finger mechanical actuator. This is used to detect various tactile sequence features from different objects by slightly squeezing them by 2 mm. A Residual Network (ResNet) model, with excellent adaptivity and feature extraction ability, is constructed to realize the recognition of 4 hardness categories and 12 object types, based on the tactile time sequence signals collected by the novel sensor array; the average accuracies of hardness and type recognition are 100% and 99.7%, respectively. To further verify the classification ability of the ResNet model for the tactile feature information detected by the sensor array, the Multilayer Perceptron (MLP), LeNet, Multi-Channel Deep Convolutional Neural Network (MCDCNN), and ENCODER models are built based on the same dataset used for the ResNet model. The average recognition accuracies of the 4hardness categories, based on those four models, are 93.6%, 98.3%, 93.3%, and 98.1%. Meanwhile, the average recognition accuracies of the 12 object types, based on the four models, are 94.7%, 98.9%, 85.0%, and 96.4%. All of the results demonstrate that the novel porous graphene tactile sensor array has excellent perceptual performance and the ResNet model can very effectively and precisely complete the hardness and type recognition of objects for the flexible tactile sensor array.

9.
Genes (Basel) ; 13(9)2022 08 25.
Artigo em Inglês | MEDLINE | ID: mdl-36140696

RESUMO

Identification of miRNA-mRNA interactions is critical to understand the new paradigms in gene regulation. Existing methods show suboptimal performance owing to inappropriate feature selection and limited integration of intuitive biological features of both miRNAs and mRNAs. The present regularized least square-based method, mintRULS, employs features of miRNAs and their target sites using pairwise similarity metrics based on free energy, sequence and repeat identities, and target site accessibility to predict miRNA-target site interactions. We hypothesized that miRNAs sharing similar structural and functional features are more likely to target the same mRNA, and conversely, mRNAs with similar features can be targeted by the same miRNA. Our prediction model achieved an impressive AUC of 0.93 and 0.92 in LOOCV and LmiTOCV settings, respectively. In comparison, other popular tools such as miRDB, TargetScan, MBSTAR, RPmirDIP, and STarMir scored AUCs at 0.73, 0.77, 0.55, 0.84, and 0.67, respectively, in LOOCV setting. Similarly, mintRULS outperformed other methods using metrics such as accuracy, sensitivity, specificity, and MCC. Our method also demonstrated high accuracy when validated against experimentally derived data from condition- and cell-specific studies and expression studies of miRNAs and target genes, both in human and mouse.


Assuntos
MicroRNAs , Animais , Regulação da Expressão Gênica , Humanos , Análise dos Mínimos Quadrados , Camundongos , MicroRNAs/metabolismo , RNA Mensageiro/metabolismo
10.
Phytochemistry ; 200: 113222, 2022 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-35561852

RESUMO

In crops, RNA editing is one of the most important post-transcriptional processes in which specific cytidines (C) in virtually all mitochondrial protein-coding genes are converted to uridines (U). Despite extensive recent research in RNA editing, exploring all of the C-to-U editing events efficiently on the genomic scale remains challengeable. Developing accurate prediction methods for the detection of RNA editing sites would dramatically reduce experimental determination. Therefore, we propose a novel method, iPReditor-CMG (improved predictive RNA editor for crop mitochondrial genomes), to predict crop mitochondrial editing sites using genome sequence and an optimised support vector machine (SVM). We first selected three mitochondrial genomes with known RNA editing sites from Arabidopsis thaliana, Brassica napus and Oryza sativa, released by NCBI, as the training and test sets. The genes and their transcripts from self-sequenced tobacco mitochondrial ATPase were selected as the validation set. The iPReditor-CMG first coded the genome sequences as numerical vectors and then performed an efficient feature selection on the high-dimensional feature space, where the SVM was employed in feature selection and following modelling. The average independent prediction accuracy of intraspecific editing sites across three species was 0.85, and up to 0.91 in A. thaliana, which outperformed the reference models. For the interspecific independent prediction, the prediction accuracy between dicotyledons was 0.78 and the accuracy between dicotyledons and monocotyledons was 0.56, which implies that there might be similarity in the C-to-U editing mechanism in close relatives. Finally, the best model was identified with an independent test accuracy of 0.91 and an AUC of 0.88, which suggested that five unreported feature sequences, i.e. TGACA, ACAAC, GTAGA, CCGTT and TAACA, are closely associated with the editing phenomenon. Multiple tests supported that the iPReditor-CMG could be effectively applied to predict editing sites in crop mitochondria, which may further contribute to understanding the mechanisms of site editing and post-transcriptional events in crop mitochondria.


Assuntos
Arabidopsis , Genoma Mitocondrial , Arabidopsis/genética , Arabidopsis/metabolismo , Genoma Mitocondrial/genética , Genômica , RNA/genética , RNA/metabolismo , Edição de RNA , Máquina de Vetores de Suporte
11.
Interdiscip Sci ; 14(2): 409-420, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35192174

RESUMO

Long non-coding RNAs play a crucial role in many life processes of cell, such as genetic markers, RNA splicing, signaling, and protein regulation. Considering that identifying lncRNA's localization in the cell through experimental methods is complicated, hard to reproduce, and expensive, we propose a novel method named IDDLncLoc in this paper, which adopts an ensemble model to solve the problem of the subcellular localization. In the proposal model, dinucleotide-based auto-cross covariance features, k-mer nucleotide composition features, and composition, transition, and distribution features are introduced to encode a raw RNA sequence to vector. To screen out reliable features, feature selection through binomial distribution, and recursive feature elimination is employed. Furthermore, strategies of oversampling in mini-batch, random sampling, and stacking ensemble strategies are customized to overcome the problem of data imbalance on the benchmark dataset. Finally, compared with the latest methods, IDDLncLoc achieves an accuracy of 94.96% on the benchmark dataset, which is 2.59% higher than the best method, and the results further demonstrate IDDLncLoc is excellent on the subcellular localization of lncRNA. Besides, a user-friendly web server is available at http://lncloc.club .


Assuntos
RNA Longo não Codificante , Sequência de Bases , Biologia Computacional/métodos , Nucleotídeos , Proteínas/genética , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismo
12.
Front Immunol ; 13: 1053617, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36618397

RESUMO

Introduction: Antibody-mediated immunity is an essential part of the immune system in vertebrates. The ability to specifically bind to antigens allows antibodies to be widely used in the therapy of cancers and other critical diseases. A key step in antibody therapeutics is the experimental identification of antibody-antigen interactions, which is generally time-consuming, costly, and laborious. Although some computational methods have been proposed to screen potential antibodies, the dependence on 3D structures still limits the application of these methods. Methods: Here, we developed a deep learning-assisted prediction method (i.e., AbAgIntPre) for fast identification of antibody-antigen interactions that only relies on amino acid sequences. A Siamese-like convolutional neural network architecture was established with the amino acid composition encoding scheme for both antigens and antibodies. Results and Discussion: The generic model of AbAgIntPre achieved satisfactory performance with the Area Under Curve (AUC) of 0.82 on a high-quality generic independent test dataset. Besides, this approach also showed competitive performance on the more specific SARS-CoV dataset. We expect that AbAgIntPre can serve as an important complement to traditional experimental methods for antibody screening and effectively reduce the workload of antibody design. The web server of AbAgIntPre is freely available at http://www.zzdlab.com/AbAgIntPre.


Assuntos
Aprendizado Profundo , Animais , Redes Neurais de Computação , Anticorpos , Sequência de Aminoácidos , Antígenos
13.
Interdiscip Sci ; 13(4): 693-702, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34143353

RESUMO

Transmembrane proteins play a vital role in cell life activities. There are several techniques to determine transmembrane protein structures and X-ray crystallography is the primary methodology. However, due to the special properties of transmembrane proteins, it is still hard to determine their structures by X-ray crystallography technique. To reduce experimental consumption and improve experimental efficiency, it is of great significance to develop computational methods for predicting the crystallization propensity of transmembrane proteins. In this work, we proposed a sequence-based machine learning method, namely Prediction of TransMembrane protein Crystallization propensity (PTMC), to predict the propensity of transmembrane protein crystallization. First, we obtained several general sequence features and the specific encoded features of relative solvent accessibility and hydrophobicity. Second, feature selection was employed to filter out redundant and irrelevant features, and the optimal feature subset is composed of hydrophobicity, amino acid composition and relative solvent accessibility. Finally, we chose extreme gradient boosting by comparing with other several machine learning methods. Comparative results on the independent test set indicate that PTMC outperforms state-of-the-art sequence-based methods in terms of sensitivity, specificity, accuracy, Matthew's Correlation Coefficient (MCC) and Area Under the receiver operating characteristic Curve (AUC). In comparison with two competitors, Bcrystal and TMCrys, PTMC achieves an improvement by 0.132 and 0.179 for sensitivity, 0.014 and 0.127 for specificity, 0.037 and 0.192 for accuracy, 0.128 and 0.362 for MCC, and 0.027 and 0.125 for AUC, respectively.


Assuntos
Biologia Computacional , Proteínas de Membrana , Cristalização , Cristalografia por Raios X , Interações Hidrofóbicas e Hidrofílicas
14.
Proteins ; 89(7): 745-761, 2021 07.
Artigo em Inglês | MEDLINE | ID: mdl-33580578

RESUMO

Cysteine (Cys) is the most reactive amino acid participating in a wide range of biological functions. In-silico predictions complement the experiments to meet the need of functional characterization. Multiple Cys function prediction algorithm is scarce, in contrast to specific function prediction algorithms. Here we present a deep neural network-based multiple Cys function prediction, available on web-server (DeepCys) (https://deepcys.herokuapp.com/). DeepCys model was trained and tested on two independent datasets curated from protein crystal structures. This prediction method requires three inputs, namely, PDB identifier (ID), chain ID and residue ID for a given Cys and outputs the probabilities of four cysteine functions, namely, disulphide, metal-binding, thioether and sulphenylation and predicts the most probable Cys function. The algorithm exploits the local and global protein properties, like, sequence and secondary structure motifs, buried fractions, microenvironments and protein/enzyme class. DeepCys outperformed most of the multiple and specific Cys function algorithms. This method can predict maximum number of cysteine functions. Moreover, for the first time, explicitly predicts thioether function. This tool was used to elucidate the cysteine functions on domains of unknown functions belonging to cytochrome C oxidase subunit-II like transmembrane domains. Apart from the web-server, a standalone program is also available on GitHub (https://github.com/vam-sin/deepcys).


Assuntos
Cisteína/química , Aprendizado Profundo , Dissulfetos/química , Complexo IV da Cadeia de Transporte de Elétrons/química , Processamento de Proteína Pós-Traducional , Software , Sequência de Aminoácidos , Cátions Bivalentes/química , Cátions Bivalentes/metabolismo , Cisteína/metabolismo , Dissulfetos/metabolismo , Complexo IV da Cadeia de Transporte de Elétrons/metabolismo , Glutationa/química , Glutationa/metabolismo , Modelos Moleculares , Compostos Nitrosos/química , Compostos Nitrosos/metabolismo , Domínios Proteicos , Estrutura Secundária de Proteína , Relação Estrutura-Atividade , Sulfetos/química , Sulfetos/metabolismo , Ácidos Sulfínicos/química , Ácidos Sulfínicos/metabolismo , Ácidos Sulfônicos/química , Ácidos Sulfônicos/metabolismo
15.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33634313

RESUMO

Three-dimensional (3D) architecture of the chromosomes is of crucial importance for transcription regulation and DNA replication. Various high-throughput chromosome conformation capture-based methods have revealed that CTCF-mediated chromatin loops are a major component of 3D architecture. However, CTCF-mediated chromatin loops are cell type specific, and most chromatin interaction capture techniques are time-consuming and labor-intensive, which restricts their usage on a very large number of cell types. Genomic sequence-based computational models are sophisticated enough to capture important features of chromatin architecture and help to identify chromatin loops. In this work, we develop Deep-loop, a convolutional neural network model, to integrate k-tuple nucleotide frequency component, nucleotide pair spectrum encoding, position conservation, position scoring function and natural vector features for the prediction of chromatin loops. By a series of examination based on cross-validation, Deep-loop shows excellent performance in the identification of the chromatin loops from different cell types. The source code of Deep-loop is freely available at the repository https://github.com/linDing-group/Deep-loop.


Assuntos
Fator de Ligação a CCCTC/genética , Cromatina/metabolismo , Genoma Humano , Redes Neurais de Computação , Fator de Ligação a CCCTC/metabolismo , Cromatina/ultraestrutura , Conjuntos de Dados como Assunto , Regulação da Expressão Gênica , Humanos , Células K562 , Células MCF-7 , Conformação Molecular , Motivos de Nucleotídeos , Software
16.
Anal Biochem ; 612: 113955, 2021 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32949607

RESUMO

Phosphorylation is a ubiquitous type of post-translational modification (PTM) that occurs in both eukaryotic and prokaryotic cells where in a phosphate group binds with amino acid residues. These specific residues, i.e., serine (S), threonine (T), and tyrosine (Y), exhibit diverse functions at the molecular level. Recent studies have determined that some diseases such as cancer, diabetes, and neurodegenerative diseases are caused by abnormal phosphorylation. Based on its potential applications in biological research and drug development, the large-scale identification of phosphorylation sites has attracted interest. Existing wet-lab technologies for targeting phosphorylation sites are overpriced and time consuming. Thus, computational algorithms that can efficiently accelerate the annotation of phosphorylation sites from massive protein sequences are needed. Numerous machine learning-based methods have been implemented for phosphorylation sites prediction. However, despite extensive efforts, existing computational approaches continue to have inadequate performance, particularly in terms of overall ACC, MCC, and AUC. In this paper, we report a novel deep learning-based predictor to overcome these performance hurdles, DeepPPSite, which was constructed using a stacked long short-term memory recurrent network for predicting phosphorylation sites. The proposed technique expediently learns the protein representations from conjoint protein descriptors. The experimental results indicated that our model achieved superior performance on the training dataset for S, T and Y, with MCC values of 0.608, 0.602, and 0.558, respectively, using a 10-fold cross-validation test. We further determined the generalization efficacy of the proposed predictor DeepPPSite by conducting a rigorous independent test. The predictive MCC values were 0.358, 0.356, and 0.350 for the S, T, and Y phosphorylation sites, respectively. Rigorous cross-validation and independent validation tests for the three types of phosphorylation sites demonstrated that the designed DeepPPSite tool significantly outperforms state-of-the-art methods.


Assuntos
Biologia Computacional/métodos , Processamento de Proteína Pós-Traducional , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Animais , Bases de Dados de Proteínas , Aprendizado Profundo , Modelos Estatísticos , Fosforilação , Curva ROC , Serina/química , Serina/metabolismo , Treonina/química , Treonina/metabolismo , Tirosina/química , Tirosina/metabolismo
17.
Methods Mol Biol ; 2254: 179-194, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33326076

RESUMO

Ribosome profiling shows potential for studying the function of long noncoding RNAs (lncRNAs). We introduce a bioinformatics pipeline for detecting ribosome-associated lncRNAs (ribo-lncRNAs) from ribosome profiling data. Further, we describe a machine-learning approach for the characterization of ribo-lncRNAs based on their sequence features. Scripts for ribo-lncRNA analysis can be accessed at ( https://ribolnc.hamadalab.com/ ).


Assuntos
Biologia Computacional/métodos , RNA Longo não Codificante/genética , Ribossomos/metabolismo , Perfilação da Expressão Gênica , Humanos , Aprendizado de Máquina , RNA Longo não Codificante/metabolismo , Software
18.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33253385

RESUMO

Coenzyme A-associated proteins (CAPs) are a category of functionally important proteins involved in multiple biological processes through interactions with coenzyme A (CoA). To date, unfortunately, the specific differences between CAPs and other proteins have yet to be systemically investigated. Moreover, there are no computational methods that can be used specifically to predict these proteins. Herein, we characterized CAPs from multifaceted viewpoints and revealed their specific preferences. Compared with other proteins, CAPs were more likely to possess binding regions for CoA and its derivatives, were evolutionarily highly conserved, exhibited ordered and hydrophobic structural conformations, and tended to be densely located in protein-protein interaction networks. Based on these biological insights, we built seven classifiers using predicted CoA-binding residue distributions, word embedding vectors, remote homolog numbers, evolutionary conservation, amino acid composition, predicted structural features and network properties. These classifiers could effectively identify CAPs in Homo sapiens, Mus musculus and Arabidopsis thaliana. The complementarity among the individual classifiers prompted us to build a two-layer stacking model named CAPE for improving prediction performance. We applied CAPE to identify some high-confidence candidates in the three species, which were tightly associated with the known functions of CAPs. Finally, we extended our algorithm to cross-species prediction, thereby developing a generic CAP prediction model. In summary, this work provides a comprehensive survey and an effective predictor for CAPs, which can help uncover the interplay between CoA and functionally relevant proteins.


Assuntos
Proteínas de Arabidopsis , Arabidopsis , Coenzima A , Bases de Dados de Proteínas , Mapas de Interação de Proteínas , Análise de Sequência de Proteína , Animais , Arabidopsis/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/metabolismo , Coenzima A/genética , Coenzima A/metabolismo , Humanos , Camundongos
19.
BMC Genomics ; 21(Suppl 11): 893, 2020 Dec 29.
Artigo em Inglês | MEDLINE | ID: mdl-33372605

RESUMO

BACKGROUND: Horizontal Gene Transfer (HGT) refers to the sharing of genetic materials between distant species that are not in a parent-offspring relationship. The HGT insertion sites are important to understand the HGT mechanisms. Recent studies in main agents of HGT, such as transposon and plasmid, demonstrate that insertion sites usually hold specific sequence features. This motivates us to find a method to infer HGT insertion sites according to sequence features. RESULTS: In this paper, we propose a deep residual network, DeepHGT, to recognize HGT insertion sites. To train DeepHGT, we extracted about 1.55 million sequence segments as training instances from 262 metagenomic samples, where the ratio between positive instances and negative instances is about 1:1. These segments are randomly partitioned into three subsets: 80% of them as the training set, 10% as the validation set, and the remaining 10% as the test set. The training loss of DeepHGT is 0.4163 and the validation loss is 0.423. On the test set, DeepHGT has achieved the area under curve (AUC) value of 0.8782. Furthermore, in order to further evaluate the generalization of DeepHGT, we constructed an independent test set containing 689,312 sequence segments from another 147 gut metagenomic samples. DeepHGT has achieved the AUC value of 0.8428, which approaches the previous test AUC value. As a comparison, the gradient boosting classifier model implemented in PyFeat achieve an AUC value of 0.694 and 0.686 on the above two test sets, respectively. Furthermore, DeepHGT could learn discriminant sequence features; for example, DeepHGT has learned a sequence pattern of palindromic subsequences as a significantly (P-value=0.0182) local feature. Hence, DeepHGT is a reliable model to recognize the HGT insertion site. CONCLUSION: DeepHGT is the first deep learning model that can accurately recognize HGT insertion sites on genomes according to the sequence pattern.


Assuntos
Aprendizado Profundo , Transferência Genética Horizontal , Sequência de Bases , Genoma , Metagenômica , Filogenia
20.
Mol Ther Nucleic Acids ; 22: 1043-1050, 2020 Dec 04.
Artigo em Inglês | MEDLINE | ID: mdl-33294291

RESUMO

Transcription factors play key roles in cell-fate decisions by regulating 3D genome conformation and gene expression. The traditional view is that methylation of DNA hinders transcription factors binding to them, but recent research has shown that many transcription factors prefer to bind to methylated DNA. Therefore, identifying such transcription factors and understanding their functions is a stepping-stone for studying methylation-mediated biological processes. In this paper, a two-step discriminated method was proposed to recognize transcription factors and their preference for methylated DNA based only on sequences information. In the first step, the proposed model was used to discriminate transcription factors from non-transcription factors. The areas under the curve (AUCs) are 0.9183 and 0.9116, respectively, for the 5-fold cross-validation test and independent dataset test. Subsequently, for the classification of transcription factors that prefer methylated DNA and transcription factors that prefer non-methylated DNA, our model could produce the AUCs of 0.7744 and 0.7356, respectively, for the 5-fold cross-validation test and independent dataset test. Based on the proposed model, a user-friendly web server called TFPred was built, which can be freely accessed at http://lin-group.cn/server/TFPred/.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA