Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Brief Funct Genomics ; 2024 Jun 23.
Artículo en Inglés | MEDLINE | ID: mdl-38912767

RESUMEN

RNA interference (RNAi) technology is widely used in the biological prevention and control of terrestrial insects. One of the main factors with the application of RNAi in insects is the difference in RNAi efficiency, which may vary not only in different insects, but also in different genes of the same insect, and even in different double-stranded RNAs (dsRNAs) of the same gene. This work focuses on the last question and establishes a bioinformatics software that can help researchers screen for the most efficient dsRNA targeting target genes. Among insects, the red flour beetle (Tribolium castaneum) is known to be one of the most sensitive to RNAi. From iBeetle-Base, we extracted 12 027 efficient dsRNA sequences with a lethality rate of ≥20% or with experimentation-induced phenotypic changes and processed these data to correspond to specific silence efficiency. Based on the first complied novel benchmark dataset, we specifically designed a deep neural network to identify and characterize efficient dsRNA for RNAi in insects. The dna2vec word embedding model was trained to extract distributed feature representations, and three powerful modules, namely convolutional neural network, bidirectional long short-term memory network, and self-attention mechanism, were integrated to form our predictor model to characterize the extracted dsRNAs and their silencing efficiencies for T. castaneum. Our model dsRNAPredictor showed reliable performance in multiple independent tests based on different species, including both T. castaneum and Aedes aegypti. This indicates that dsRNAPredictor can facilitate prescreening for designing high-efficiency dsRNA targeting target genes of insects in advance.

2.
Methods ; 227: 48-57, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38734394

RESUMEN

Studies have shown that protein glycosylation in cells reflects the real-time dynamics of biological processes, and the occurrence and development of many diseases are closely related to protein glycosylation. Abnormal protein glycosylation can be used as a potential diagnostic and prognostic marker of a disease, as well as a therapeutic target and a new breakthrough point for exploring pathogenesis. To address the issue of significant differences in the prediction results of previous models for different species, we constructed a hybrid deep learning model N-GlycoPred on the basis of dual-layer convolution, a paired attention mechanism and BiLSTM for accurate identification of N-glycosylation sites. By adopting one-hot encoding or the AAindex, we specifically selected the optimum combination of features and deep learning frameworks for human and mouse to refine the models. Based on six independent test datasets, our N-GlycoPred model achieved an average AUC of 0.9553, which is 0.23% higher than MusiteDeep. The comparison results indicate that our model can serve as a powerful tool for N-glycosylation site prescreening for biological researchers.


Asunto(s)
Aprendizaje Profundo , Glicosilación , Humanos , Animales , Ratones
3.
IEEE J Biomed Health Inform ; 28(7): 4325-4335, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38578862

RESUMEN

Circular RNAs (circRNAs) exist in vivo and are a class of noncoding RNA molecules. They have a single-stranded, closed, annular structure. Many studies have shown that circRNAs and diseases are linked. Therefore, it is critical to build a reliable and accurate predictor to find the circRNA-disease association. In this paper, we presented a meta-learning model named MAMLCDA to identify the circRNA-disease association, which is based on model-agnostic meta-learning (MAML) combined with CNN classification. Specifically, similarities between diseases and circRNAs are extracted and integrated to characterize their relationships, and k-means is used to cluster majority samples and select a certain number of samples from each cluster to obtain the same number of negative samples as the positive samples. To further reduce the dimension of the features and save operation time, we applied probabilistic principal component analysis (PPCA) to compact the integrated circRNA and disease similarity network feature vectors. The feature vectors are converted into images. At this time, the prediction problem is transformed into the 2-way 1-shot problem of the image and input into the model with MAML as the meta-learner and CNN as the base-learner. Comparison results of five-fold cross-validation on two benchmark datasets illustrate that MAMLCDA outperforms several state-of-the-art approaches with the best accuracies of 95.33% and 98%. Therefore, MAMLCDA can help to understand the pathogenesis of complex diseases at the circRNA level.


Asunto(s)
Redes Neurales de la Computación , ARN Circular , ARN Circular/genética , Humanos , Biología Computacional/métodos , Aprendizaje Automático , Algoritmos , Análisis de Componente Principal
4.
J Chem Inf Model ; 64(7): 2393-2404, 2024 Apr 08.
Artículo en Inglés | MEDLINE | ID: mdl-37799091

RESUMEN

Antimicrobial peptides (AMPs) are small molecular polypeptides that can be widely used in the prevention and treatment of microbial infections. Although many computational models have been proposed to help identify AMPs, a high-performance and interpretable model is still lacking. In this study, new benchmark data sets are collected and processed, and a stacking deep architecture named AMPpred-MFA is carefully designed to discover and identify AMPs. Multiple features and a multihead attention mechanism are utilized on the basis of a bidirectional long short-term memory (LSTM) network and a convolutional neural network (CNN). The effectiveness of AMPpred-MFA is verified through five independent tests conducted in batches. Experimental results show that AMPpred-MFA achieves a state-of-the-art performance. The visualization interpretability analyses and ablation experiments offer a further understanding of the model behavior and performance, validating the importance of our feature representation and stacking architecture, especially the multihead attention mechanism. Therefore, AMPpred-MFA can be considered a reliable and efficient approach to understanding and predicting AMPs.


Asunto(s)
Péptidos Antimicrobianos , Benchmarking , Redes Neurales de la Computación
5.
J Proteome Res ; 23(1): 95-106, 2024 01 05.
Artículo en Inglés | MEDLINE | ID: mdl-38054441

RESUMEN

O-linked ß-N-acetylglucosamine (O-GlcNAc) is a post-translational modification (i.e., O-GlcNAcylation) on serine/threonine residues of proteins, regulating a plethora of physiological and pathological events. As a dynamic process, O-GlcNAc functions in a site-specific manner. However, the experimental identification of the O-GlcNAc sites remains challenging in many scenarios. Herein, by leveraging the recent progress in cataloguing experimentally identified O-GlcNAc sites and advanced deep learning approaches, we establish an ensemble model, O-GlcNAcPRED-DL, a deep learning-based tool, for the prediction of O-GlcNAc sites. In brief, to make a benchmark O-GlcNAc data set, we extracted the information on O-GlcNAc from the recently constructed database O-GlcNAcAtlas, which contains thousands of experimentally identified and curated O-GlcNAc sites on proteins from multiple species. To overcome the imbalance between positive and negative data sets, we selected five groups of negative data sets in humans and mice to construct an ensemble predictor based on connection of a convolutional neural network and bidirectional long short-term memory. By taking into account three types of sequence information, we constructed four network frameworks, with the systematically optimized parameters used for the models. The thorough comparison analysis on two independent data sets of humans and mice and six independent data sets from other species demonstrated remarkably increased sensitivity and accuracy of the O-GlcNAcPRED-DL models, outperforming other existing tools. Moreover, a user-friendly Web server for O-GlcNAcPRED-DL has been constructed, which is freely available at http://oglcnac.org/pred_dl.


Asunto(s)
Aprendizaje Profundo , Humanos , Animales , Ratones , Proteínas/metabolismo , Procesamiento Proteico-Postraduccional , Acetilglucosamina/química , N-Acetilglucosaminiltransferasas/metabolismo
6.
Math Biosci Eng ; 20(9): 15809-15829, 2023 07 31.
Artículo en Inglés | MEDLINE | ID: mdl-37919990

RESUMEN

Transcription factors (TFs) are important factors that regulate gene expression. Revealing the mechanism affecting the binding specificity of TFs is the key to understanding gene regulation. Most of the previous studies focus on TF-DNA binding sites at the sequence level, and they seldom utilize the contextual features of DNA sequences. In this paper, we develop an integrated spatiotemporal context-aware neural network framework, named GNet, for predicting TF-DNA binding signal at single nucleotide resolution by achieving three tasks: single nucleotide resolution signal prediction, identification of binding regions at the sequence level, and TF-DNA binding motif prediction. GNet extracts implicit spatial contextual information with a gated highway neural mechanism, which captures large context multi-level patterns using linear shortcut connections, and the idea of it permeates the encoder and decoder parts of GNet. The improved dual external attention mechanism, which learns implicit relationships both within and among samples, and improves the performance of the model. Experimental results on 53 human TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets shows that GNet outperforms the state-of-the-art methods in the three tasks, and the results of cross-species studies on 15 human and 18 mouse TF datasets of the corresponding TF families indicate that GNet also shows the best performance in cross-species prediction over the competitive methods.


Asunto(s)
Nucleótidos , Factores de Transcripción , Humanos , Animales , Ratones , Nucleótidos/metabolismo , Unión Proteica , Factores de Transcripción/genética , Cromatina , ADN
7.
Brief Bioinform ; 24(5)2023 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-37609923

RESUMEN

The formation of biomolecular condensates by liquid-liquid phase separation (LLPS) has become a universal mechanism for spatiotemporal coordination of biological activities in cells and has been widely observed to directly regulate the key cellular processes involved in cancer cell pathology. However, the complexity of protein sequences and the diversity of conformations are inherently disordered, which poses great challenges for LLPS protein calculations and experimental research. Herein, we proposed a novel predictor named PredLLPS_PSSM for LLPS protein identification based only on sequence evolution information. Because finding real and reliable samples is the cornerstone of building predictors, we collected anew and collated the LLPS proteins from the latest versions of three databases. By comparing the performance of the position-specific score matrix (PSSM) and word embedding, PredLLPS_PSSM combined PSSM-based information and two deep learning frameworks. Independent tests using three existing independent test datasets and two newly constructed independent test datasets demonstrated the superiority of PredLLPS_PSSM compared with state-of-the-art methods. Furthermore, we tested PredLLPS_PSSM on nine experimentally identified LLPS proteins from three insects that were not included in any of the databases. In addition, the powerful Shapley Additive exPlanation algorithm and heatmap were applied to find the most critical amino acids relevant to LLPS.


Asunto(s)
Redes Neurales de la Computación , Proteínas , Proteínas/química , Algoritmos , Aminoácidos/química , Secuencia de Aminoácidos
8.
Front Genet ; 14: 1226905, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37576553

RESUMEN

Neuropeptides contain more chemical information than other classical neurotransmitters and have multiple receptor recognition sites. These characteristics allow neuropeptides to have a correspondingly higher selectivity for nerve receptors and fewer side effects. Traditional experimental methods, such as mass spectrometry and liquid chromatography technology, still need the support of a complete neuropeptide precursor database and the basic characteristics of neuropeptides. Incomplete neuropeptide precursor and information databases will lead to false-positives or reduce the sensitivity of recognition. In recent years, studies have proven that machine learning methods can rapidly and effectively predict neuropeptides. In this work, we have made a systematic attempt to create an ensemble tool based on four convolution neural network models. These baseline models were separately trained on one-hot encoding, AAIndex, G-gap dipeptide encoding and word2vec and integrated using Gaussian Naive Bayes (NB) to construct our predictor designated NeuroCNN_GNB. Both 5-fold cross-validation tests using benchmark datasets and independent tests showed that NeuroCNN_GNB outperformed other state-of-the-art methods. Furthermore, this novel framework provides essential interpretations that aid the understanding of model success by leveraging the powerful Shapley Additive exPlanation (SHAP) algorithm, thereby highlighting the most important features relevant for predicting neuropeptides.

9.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37291763

RESUMEN

BACKGROUND: Promoters are DNA regions that initiate the transcription of specific genes near the transcription start sites. In bacteria, promoters are recognized by RNA polymerases and associated sigma factors. Effective promoter recognition is essential for synthesizing the gene-encoded products by bacteria to grow and adapt to different environmental conditions. A variety of machine learning-based predictors for bacterial promoters have been developed; however, most of them were designed specifically for a particular species. To date, only a few predictors are available for identifying general bacterial promoters with limited predictive performance. RESULTS: In this study, we developed TIMER, a Siamese neural network-based approach for identifying both general and species-specific bacterial promoters. Specifically, TIMER uses DNA sequences as the input and employs three Siamese neural networks with the attention layers to train and optimize the models for a total of 13 species-specific and general bacterial promoters. Extensive 10-fold cross-validation and independent tests demonstrated that TIMER achieves a competitive performance and outperforms several existing methods on both general and species-specific promoter prediction. As an implementation of the proposed method, the web server of TIMER is publicly accessible at http://web.unimelb-bioinfortools.cloud.edu.au/TIMER/.


Asunto(s)
Bacterias , Redes Neurales de la Computación , Bacterias/genética , Bacterias/metabolismo , ARN Polimerasas Dirigidas por ADN/genética , ARN Polimerasas Dirigidas por ADN/metabolismo , Secuencia de Bases , Regiones Promotoras Genéticas
10.
Math Biosci Eng ; 20(4): 6853-6865, 2023 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-37161131

RESUMEN

Phasic small interfering RNAs are plant secondary small interference RNAs that typically generated by the convergence of miRNAs and polyadenylated mRNAs. A growing number of studies have shown that miRNA-initiated phasiRNA plays crucial roles in regulating plant growth and stress responses. Experimental verification of miRNA-initiated phasiRNA loci may take considerable time, energy and labor. Therefore, computational methods capable of processing high throughput data have been proposed one by one. In this work, we proposed a predictor (DIGITAL) for identifying miRNA-initiated phasiRNAs in plant, which combined a multi-scale residual network with a bi-directional long-short term memory network. The negative dataset was constructed based on positive data, through replacing 60% of nucleotides randomly in each positive sample. Our predictor achieved the accuracy of 98.48% and 94.02% respectively on two independent test datasets with different sequence length. These independent testing results indicate the effectiveness of our model. Furthermore, DIGITAL is of robustness and generalization ability, and thus can be easily extended and applied for miRNA target recognition of other species. We provide the source code of DIGITAL, which is freely available at https://github.com/yuanyuanbu/DIGITAL.


Asunto(s)
Aprendizaje Profundo , MicroARNs , MicroARNs/genética , Desarrollo de la Planta , ARN Mensajero , Programas Informáticos
11.
Brief Funct Genomics ; 22(3): 274-280, 2023 05 18.
Artículo en Inglés | MEDLINE | ID: mdl-36528813

RESUMEN

Antiviral defenses are one of the significant roles of RNA interference (RNAi) in plants. It has been reported that the host RNAi mechanism machinery can target viral RNAs for destruction because virus-derived small interfering RNAs (vsiRNAs) are found in infected host cells. Therefore, the recognition of plant vsiRNAs is the key to understanding the functional mechanisms of vsiRNAs and developing antiviral plants. In this work, we introduce a deep learning-based stacking ensemble approach, named computational prediction of plant exclusive virus-derived small interfering RNAs (COPPER), for plant vsiRNA prediction. COPPER used word2vec and fastText to generate sequence features and a hybrid deep learning framework, including a convolutional neural network, multiscale residual network and bidirectional long short-term memory network with a self-attention mechanism to enable precise predictions of plant vsiRNAs. Extensive benchmarking experiments with different sequence homology thresholds and ablation studies illustrated the comparative predictive performance of COPPER. In addition, the performance comparison with PVsiRNAPred conducted on an independent test dataset showed that COPPER significantly improved the predictive performance for plant vsiRNAs compared with other state-of-the-art methods. The datasets and source codes are publicly available at https://github.com/yuanyuanbu/COPPER.


Asunto(s)
Aprendizaje Profundo , Virus de Plantas , ARN Interferente Pequeño/genética , Cobre , Interferencia de ARN , Plantas/genética , Virus de Plantas/genética , Antivirales
12.
Brief Bioinform ; 23(6)2022 11 19.
Artículo en Inglés | MEDLINE | ID: mdl-36341591

RESUMEN

Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.


Asunto(s)
Núcleo Celular , Proteínas , ARN Mensajero/genética , Núcleo Celular/genética , Biología Computacional/métodos , Bases de Datos de Proteínas
13.
Brief Bioinform ; 23(2)2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-35021193

RESUMEN

Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.


Asunto(s)
Drosophila melanogaster , Eucariontes , Animales , Biología Computacional/métodos , Drosophila melanogaster/genética , Células Eucariotas , Ratones , Células Procariotas , Regiones Promotoras Genéticas
14.
Math Biosci Eng ; 19(1): 775-791, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34903012

RESUMEN

As one of the most significant protein post-translational modifications (PTMs) in eukaryotes, ubiquitylation plays an essential role in regulating diverse cellular functions, such as apoptosis, cell division, DNA repair and replication, intracellular transport and immune reactions. Traditional experimental methods have the defect of being time-consuming, costly and labor-intensive. Therefore, it is highly desired to develop automated computational methods that can recognize potential ubiquitylation sites rapidly and accurately. In this study, we propose a novel predictor, named UPFPSR, for predicting lysine ubiquitylation sites in plant. UPFPSR is developed using multiple physicochemical properties of amino acids and sequence-based statistical information. In order to find a suitable classification algorithm, four traditional algorithms and two deep learning networks are compared, and the random forest with superior performance is selected ultimately. An extensive benchmarking shows that UPFPSR outperforms the most advanced ubiquitylation prediction tool on each measurement indicator, with the accuracy of 77.3%, precision of 75%, recall of 81.7%, F1-score of 0.7824, and AUC of 0.84 on the independent test dataset. The results indicate that UPFPSR can provide new guidance for further experimental study on ubiquitylation. The data sets and source code used in this study are freely available at https://github.com/ysw-sunshine/UPFPSR.


Asunto(s)
Lisina , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Lisina/química , Lisina/metabolismo , Procesamiento Proteico-Postraduccional , Ubiquitinación
15.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34729589

RESUMEN

Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.


Asunto(s)
Algoritmos , Biología Computacional , Biología Computacional/métodos , Aprendizaje Automático Supervisado
16.
J Bioinform Comput Biol ; 20(1): 2150029, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-34806952

RESUMEN

O-glycosylation is a protein posttranslational modification important in regulating almost all cells. It is related to a large number of physiological and pathological phenomena. Recognizing O-glycosylation sites is the key to further investigating the molecular mechanism of protein posttranslational modification. This study aimed to collect a reliable dataset on Homo sapiens and develop an O-glycosylation predictor for Homo sapiens, named Captor, through multiple features. A random undersampling method and a synthetic minority oversampling technique were employed to deal with imbalanced data. In addition, the Kruskal-Wallis (K-W) test was adopted to optimize feature vectors and improve the performance of the model. A support vector machine, due to its optimal performance, was used to train and optimize the final prediction model after a comprehensive comparison of various classifiers in traditional machine learning methods and deep learning. On the independent test set, Captor outperformed the existing O-glycosylation tool, suggesting that Captor could provide more instructive guidance for further experimental research on O-glycosylation. The source code and datasets are available at https://github.com/YanZhu06/Captor/.


Asunto(s)
Biología Computacional , Máquina de Vectores de Soporte , Biología Computacional/métodos , Glicosilación , Humanos , Aprendizaje Automático , Programas Informáticos
17.
Math Biosci Eng ; 19(12): 13294-13305, 2022 09 13.
Artículo en Inglés | MEDLINE | ID: mdl-36654047

RESUMEN

Regulatory elements in DNA sequences, such as promoters, enhancers, terminators and so on, are essential for gene expression in physiological and pathological processes. A promoter is the specific DNA sequence that is located upstream of the coding gene and acts as the "switch" for gene transcriptional regulation. Lots of promoter predictors have been developed for different bacterial species, but only a few are designed for Pseudomonas aeruginosa, a widespread Gram-negative conditional pathogen in nature. In this work, an ensemble model named SPREAD is proposed for the recognition of promoters in Pseudomonas aeruginosa. In SPREAD, the DNA sequence autoencoder model LSTM is employed to extract potential sequence information, and the mean output probability value of CNN and RF is applied as the final prediction. Compared with G4PromFinder, the only state-of-the-art classifier for promoters in Pseudomonas aeruginosa, SPREAD improves the prediction performance significantly, with an accuracy of 0.98, recall of 0.98, precision of 0.98, specificity of 0.97 and F1-score of 0.98.


Asunto(s)
Bacterias , Pseudomonas aeruginosa , Pseudomonas aeruginosa/genética , Pseudomonas aeruginosa/metabolismo , Regiones Promotoras Genéticas , ADN
18.
Mol Ther Nucleic Acids ; 26: 1027-1034, 2021 Dec 03.
Artículo en Inglés | MEDLINE | ID: mdl-34786208

RESUMEN

5-Methylcytosine (m5C) is an important post-transcriptional modification that has been extensively found in multiple types of RNAs. Many studies have shown that m5C plays vital roles in many biological functions, such as RNA structure stability and metabolism. Computational approaches act as an efficient way to identify m5C sites from high-throughput RNA sequence data and help interpret the functional mechanism of this important modification. This study proposed a novel species-specific computational approach, Staem5, to accurately predict RNA m5C sites in Mus musculus and Arabidopsis thaliana. Staem5 was developed by employing feature fusion tactics to leverage informatic sequence profiles, and a stacking ensemble learning framework combined five popular machine learning algorithms. Extensive benchmarking tests demonstrated that Staem5 outperformed state-of-the-art approaches in both cross-validation and independent tests. We provide the source code of Staem5, which is publicly available at https://github.com/Cxd-626/Staem5.git.

19.
IEEE/ACM Trans Comput Biol Bioinform ; 18(5): 1937-1945, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-31804942

RESUMEN

Lysine formylation is a reversible type of protein post-translational modification and has been found to be involved in a myriad of biological processes, including modulation of chromatin conformation and gene expression in histones and other nuclear proteins. Accurate identification of lysine formylation sites is essential for elucidating the underlying molecular mechanisms of formylation. Traditional experimental methods are time-consuming and expensive. As such, it is desirable and necessary to develop computational methods for accurate prediction of formylation sites. In this study, we propose a novel predictor, termed Formator, for identifying lysine formylation sites from sequences information. Formator is developed using the ensemble learning (EL) strategy based on four individual support vector machine classifiers via a voting system. Moreover, the most distant undersampling and Safe-Level-SMOTE oversampling techniques were integrated to deal with the data imbalance problem of the training dataset. Four effective feature extraction methods, namely bi-profile Bayes (BPB), k-nearest neighbor (KNN), amino acid physicochemical properties (AAindex), and composition and transition (CTD) were employed to encode the surrounding sequence features of potential formylation sites. Extensive empirical studies show that Formator achieved the accuracy of 87.24 and 74.96 percent on jackknife test and the independent test, respectively. Performance comparison results on the independent test indicate that Formator outperforms current existing prediction tool, LFPred, suggesting that it has a great potential to serve as a useful tool in identifying novel lysine formylation sites and facilitating hypothesis-driven experimental efforts.


Asunto(s)
Histonas , Lisina , Procesamiento Proteico-Postraduccional/genética , Análisis de Secuencia de Proteína/métodos , Algoritmos , Teorema de Bayes , Biología Computacional , Histonas/química , Histonas/genética , Histonas/metabolismo , Lisina/química , Lisina/genética , Lisina/metabolismo , Máquina de Vectores de Soporte
20.
Brief Bioinform ; 22(2): 2073-2084, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-32227075

RESUMEN

The development of deep sequencing technologies has led to the discovery of novel transcripts. Many in silico methods have been developed to assess the coding potential of these transcripts to further investigate their functions. Existing methods perform well on distinguishing majority long noncoding RNAs (lncRNAs) and coding RNAs (mRNAs) but poorly on RNAs with small open reading frames (sORFs). Here, we present DeepCPP (deep neural network for coding potential prediction), a deep learning method for RNA coding potential prediction. Extensive evaluations on four previous datasets and six new datasets constructed in different species show that DeepCPP outperforms other state-of-the-art methods, especially on sORF type data, which overcomes the bottleneck of sORF mRNA identification by improving more than 4.31, 37.24 and 5.89% on its accuracy for newly discovered human, vertebrate and insect data, respectively. Additionally, we also revealed that discontinuous k-mer, and our newly proposed nucleotide bias and minimal distribution similarity feature selection method play crucial roles in this classification problem. Taken together, DeepCPP is an effective method for RNA coding potential prediction.


Asunto(s)
Aprendizaje Profundo , Redes Neurales de la Computación , Animales , Humanos , Sistemas de Lectura Abierta , ARN Largo no Codificante/genética , ARN Mensajero/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA