Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 82
Filtrar
1.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37874948

RESUMO

Proteases contribute to a broad spectrum of cellular functions. Given a relatively limited amount of experimental data, developing accurate sequence-based predictors of substrate cleavage sites facilitates a better understanding of protease functions and substrate specificity. While many protease-specific predictors of substrate cleavage sites were developed, these efforts are outpaced by the growth of the protease substrate cleavage data. In particular, since data for 100+ protease types are available and this number continues to grow, it becomes impractical to publish predictors for new protease types, and instead it might be better to provide a computational platform that helps users to quickly and efficiently build predictors that address their specific needs. To this end, we conceptualized, developed, tested and released a versatile bioinformatics platform, ProsperousPlus, that empowers users, even those with no programming or little bioinformatics background, to build fast and accurate predictors of substrate cleavage sites. ProsperousPlus facilitates the use of the rapidly accumulating substrate cleavage data to train, empirically assess and deploy predictive models for user-selected substrate types. Benchmarking tests on test datasets show that our platform produces predictors that on average exceed the predictive performance of current state-of-the-art approaches. ProsperousPlus is available as a webserver and a stand-alone software package at http://prosperousplus.unimelb-biotools.cloud.edu.au/.


Assuntos
Aprendizado de Máquina , Peptídeo Hidrolases , Peptídeo Hidrolases/metabolismo , Especificidade por Substrato , Algoritmos
2.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37150785

RESUMO

A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e. transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species, including Homo sapiens, Mus musculus and Drosophila melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then, we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a web server for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilized as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterize their roles in post-transcriptional regulation.


Assuntos
Drosophila melanogaster , Edição de RNA , Animais , Camundongos , Humanos , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , RNA/genética , Adenosina/genética , Adenosina/metabolismo , Inosina/genética , Inosina/metabolismo
3.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37291763

RESUMO

BACKGROUND: Promoters are DNA regions that initiate the transcription of specific genes near the transcription start sites. In bacteria, promoters are recognized by RNA polymerases and associated sigma factors. Effective promoter recognition is essential for synthesizing the gene-encoded products by bacteria to grow and adapt to different environmental conditions. A variety of machine learning-based predictors for bacterial promoters have been developed; however, most of them were designed specifically for a particular species. To date, only a few predictors are available for identifying general bacterial promoters with limited predictive performance. RESULTS: In this study, we developed TIMER, a Siamese neural network-based approach for identifying both general and species-specific bacterial promoters. Specifically, TIMER uses DNA sequences as the input and employs three Siamese neural networks with the attention layers to train and optimize the models for a total of 13 species-specific and general bacterial promoters. Extensive 10-fold cross-validation and independent tests demonstrated that TIMER achieves a competitive performance and outperforms several existing methods on both general and species-specific promoter prediction. As an implementation of the proposed method, the web server of TIMER is publicly accessible at http://web.unimelb-bioinfortools.cloud.edu.au/TIMER/.


Assuntos
Bactérias , Redes Neurais de Computação , Bactérias/genética , Bactérias/metabolismo , RNA Polimerases Dirigidas por DNA/genética , RNA Polimerases Dirigidas por DNA/metabolismo , Sequência de Bases , Regiões Promotoras Genéticas
4.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36880172

RESUMO

Lysine 2-hydroxyisobutylation (Khib), which was first reported in 2014, has been shown to play vital roles in a myriad of biological processes including gene transcription, regulation of chromatin functions, purine metabolism, pentose phosphate pathway and glycolysis/gluconeogenesis. Identification of Khib sites in protein substrates represents an initial but crucial step in elucidating the molecular mechanisms underlying protein 2-hydroxyisobutylation. Experimental identification of Khib sites mainly depends on the combination of liquid chromatography and mass spectrometry. However, experimental approaches for identifying Khib sites are often time-consuming and expensive compared with computational approaches. Previous studies have shown that Khib sites may have distinct characteristics for different cell types of the same species. Several tools have been developed to identify Khib sites, which exhibit high diversity in their algorithms, encoding schemes and feature selection techniques. However, to date, there are no tools designed for predicting cell type-specific Khib sites. Therefore, it is highly desirable to develop an effective predictor for cell type-specific Khib site prediction. Inspired by the residual connection of ResNet, we develop a deep learning-based approach, termed ResNetKhib, which leverages both the one-dimensional convolution and transfer learning to enable and improve the prediction of cell type-specific 2-hydroxyisobutylation sites. ResNetKhib is capable of predicting Khib sites for four human cell types, mouse liver cell and three rice cell types. Its performance is benchmarked against the commonly used random forest (RF) predictor on both 10-fold cross-validation and independent tests. The results show that ResNetKhib achieves the area under the receiver operating characteristic curve values ranging from 0.807 to 0.901, depending on the cell type and species, which performs better than RF-based predictors and other currently available Khib site prediction tools. We also implement an online web server of the proposed ResNetKhib algorithm together with all the curated datasets and trained model for the wider research community to use, which is publicly accessible at https://resnetkhib.erc.monash.edu/.


Assuntos
Lisina , Processamento de Proteína Pós-Traducional , Animais , Camundongos , Humanos , Lisina/metabolismo , Proteínas/metabolismo , Algoritmos , Aprendizado de Máquina
5.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36528806

RESUMO

Determining the pathogenicity and functional impact (i.e. gain-of-function; GOF or loss-of-function; LOF) of a variant is vital for unraveling the genetic level mechanisms of human diseases. To provide a 'one-stop' framework for the accurate identification of pathogenicity and functional impact of variants, we developed a two-stage deep-learning-based computational solution, termed VPatho, which was trained using a total of 9619 pathogenic GOF/LOF and 138 026 neutral variants curated from various databases. A total number of 138 variant-level, 262 protein-level and 103 genome-level features were extracted for constructing the models of VPatho. The development of VPatho consists of two stages: (i) a random under-sampling multi-scale residual neural network (ResNet) with a newly defined weighted-loss function (RUS-Wg-MSResNet) was proposed to predict variants' pathogenicity on the gnomAD_NV + GOF/LOF dataset; and (ii) an XGBOD model was constructed to predict the functional impact of the given variants. Benchmarking experiments demonstrated that RUS-Wg-MSResNet achieved the highest prediction performance with the weights calculated based on the ratios of neutral versus pathogenic variants. Independent tests showed that both RUS-Wg-MSResNet and XGBOD achieved outstanding performance. Moreover, assessed using variants from the CAGI6 competition, RUS-Wg-MSResNet achieved superior performance compared to state-of-the-art predictors. The fine-trained XGBOD models were further used to blind test the whole LOF data downloaded from gnomAD and accordingly, we identified 31 nonLOF variants that were previously labeled as LOF/uncertain variants. As an implementation of the developed approach, a webserver of VPatho is made publicly available at http://csbio.njust.edu.cn/bioinf/vpatho/ to facilitate community-wide efforts for profiling and prioritizing the query variants with respect to their pathogenicity and functional impact.


Assuntos
Aprendizado Profundo , Humanos , Mutação com Ganho de Função , Genoma
6.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37369638

RESUMO

Antimicrobial peptides (AMPs) are short peptides that play crucial roles in diverse biological processes and have various functional activities against target organisms. Due to the abuse of chemical antibiotics and microbial pathogens' increasing resistance to antibiotics, AMPs have the potential to be alternatives to antibiotics. As such, the identification of AMPs has become a widely discussed topic. A variety of computational approaches have been developed to identify AMPs based on machine learning algorithms. However, most of them are not capable of predicting the functional activities of AMPs, and those predictors that can specify activities only focus on a few of them. In this study, we first surveyed 10 predictors that can identify AMPs and their functional activities in terms of the features they employed and the algorithms they utilized. Then, we constructed comprehensive AMP datasets and proposed a new deep learning-based framework, iAMPCN (identification of AMPs based on CNNs), to identify AMPs and their related 22 functional activities. Our experiments demonstrate that iAMPCN significantly improved the prediction performance of AMPs and their corresponding functional activities based on four types of sequence features. Benchmarking experiments on the independent test datasets showed that iAMPCN outperformed a number of state-of-the-art approaches for predicting AMPs and their functional activities. Furthermore, we analyzed the amino acid preferences of different AMP activities and evaluated the model on datasets of varying sequence redundancy thresholds. To facilitate the community-wide identification of AMPs and their corresponding functional types, we have made the source codes of iAMPCN publicly available at https://github.com/joy50706/iAMPCN/tree/master. We anticipate that iAMPCN can be explored as a valuable tool for identifying potential AMPs with specific functional activities for further experimental validation.


Assuntos
Peptídeos Catiônicos Antimicrobianos , Aprendizado Profundo , Peptídeos Catiônicos Antimicrobianos/farmacologia , Peptídeos Antimicrobianos , Antibacterianos , Algoritmos
7.
Brief Bioinform ; 23(4)2022 07 18.
Artigo em Inglês | MEDLINE | ID: mdl-35649392

RESUMO

RNA binding proteins (RBPs) are critical for the post-transcriptional control of RNAs and play vital roles in a myriad of biological processes, such as RNA localization and gene regulation. Therefore, computational methods that are capable of accurately identifying RBPs are highly desirable and have important implications for biomedical and biotechnological applications. Here, we propose a two-stage deep transfer learning-based framework, termed RBP-TSTL, for accurate prediction of RBPs. In the first stage, the knowledge from the self-supervised pre-trained model was extracted as feature embeddings and used to represent the protein sequences, while in the second stage, a customized deep learning model was initialized based on an annotated pre-training RBPs dataset before being fine-tuned on each corresponding target species dataset. This two-stage transfer learning framework can enable the RBP-TSTL model to be effectively trained to learn and improve the prediction performance. Extensive performance benchmarking of the RBP-TSTL models trained using the features generated by the self-supervised pre-trained model and other models trained using hand-crafting encoding features demonstrated the effectiveness of the proposed two-stage knowledge transfer strategy based on the self-supervised pre-trained models. Using the best-performing RBP-TSTL models, we further conducted genome-scale RBP predictions for Homo sapiens, Arabidopsis thaliana, Escherichia coli, and Salmonella and established a computational compendium containing all the predicted putative RBPs candidates. We anticipate that the proposed RBP-TSTL approach will be explored as a useful tool for the characterization of RNA-binding proteins and exploration of their sequence-structure-function relationships.


Assuntos
Proteínas de Ligação a RNA , RNA , Sítios de Ligação/genética , Genoma , Humanos , Aprendizado de Máquina , RNA/química , Proteínas de Ligação a RNA/metabolismo , Análise de Sequência de RNA/métodos
8.
Brief Bioinform ; 23(2)2022 03 10.
Artigo em Inglês | MEDLINE | ID: mdl-35176756

RESUMO

Protein secretion has a pivotal role in many biological processes and is particularly important for intercellular communication, from the cytoplasm to the host or external environment. Gram-positive bacteria can secrete proteins through multiple secretion pathways. The non-classical secretion pathway has recently received increasing attention among these secretion pathways, but its exact mechanism remains unclear. Non-classical secreted proteins (NCSPs) are a class of secreted proteins lacking signal peptides and motifs. Several NCSP predictors have been proposed to identify NCSPs and most of them employed the whole amino acid sequence of NCSPs to construct the model. However, the sequence length of different proteins varies greatly. In addition, not all regions of the protein are equally important and some local regions are not relevant to the secretion. The functional regions of the protein, particularly in the N- and C-terminal regions, contain important determinants for secretion. In this study, we propose a new hybrid deep learning-based framework, referred to as ASPIRER, which improves the prediction of NCSPs from amino acid sequences. More specifically, it combines a whole sequence-based XGBoost model and an N-terminal sequence-based convolutional neural network model; 5-fold cross-validation and independent tests demonstrate that ASPIRER achieves superior performance than existing state-of-the-art approaches. The source code and curated datasets of ASPIRER are publicly available at https://github.com/yanwu20/ASPIRER/. ASPIRER is anticipated to be a useful tool for improved prediction of novel putative NCSPs from sequences information and prioritization of candidate proteins for follow-up experimental validation.


Assuntos
Aprendizado Profundo , Sequência de Aminoácidos , Biologia Computacional , Redes Neurais de Computação , Proteínas/química , Software
9.
Brief Bioinform ; 23(2)2022 03 10.
Artigo em Inglês | MEDLINE | ID: mdl-35021193

RESUMO

Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.


Assuntos
Drosophila melanogaster , Eucariotos , Animais , Biologia Computacional/métodos , Drosophila melanogaster/genética , Células Eucarióticas , Camundongos , Células Procarióticas , Regiões Promotoras Genéticas
10.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36341591

RESUMO

Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.


Assuntos
Núcleo Celular , Proteínas , RNA Mensageiro/genética , Núcleo Celular/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas
11.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34729589

RESUMO

Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.


Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Aprendizado de Máquina Supervisionado
12.
J Chem Inf Model ; 64(4): 1407-1418, 2024 Feb 26.
Artigo em Inglês | MEDLINE | ID: mdl-38334115

RESUMO

Studying the effect of single amino acid variations (SAVs) on protein structure and function is integral to advancing our understanding of molecular processes, evolutionary biology, and disease mechanisms. Screening for deleterious variants is one of the crucial issues in precision medicine. Here, we propose a novel computational approach, TransEFVP, based on large-scale protein language model embeddings and a transformer-based neural network to predict disease-associated SAVs. The model adopts a two-stage architecture: the first stage is designed to fuse different feature embeddings through a transformer encoder. In the second stage, a support vector machine model is employed to quantify the pathogenicity of SAVs after dimensionality reduction. The prediction performance of TransEFVP on blind test data achieves a Matthews correlation coefficient of 0.751, an F1-score of 0.846, and an area under the receiver operating characteristic curve of 0.871, higher than the existing state-of-the-art methods. The benchmark results demonstrate that TransEFVP can be explored as an accurate and effective SAV pathogenicity prediction method. The data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master for academic use.


Assuntos
Algoritmos , Proteínas , Humanos , Proteínas/química , Sequência de Aminoácidos , Redes Neurais de Computação , Aminoácidos
13.
Environ Sci Technol ; 58(10): 4662-4669, 2024 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-38422482

RESUMO

Since the mass production and extensive use of chloroquine (CLQ) would lead to its inevitable discharge, wastewater treatment plants (WWTPs) might play a key role in the management of CLQ. Despite the reported functional versatility of ammonia-oxidizing bacteria (AOB) that mediate the first step for biological nitrogen removal at WWTP (i.e., partial nitrification), their potential capability to degrade CLQ remains to be discovered. Therefore, with the enriched partial nitrification sludge, a series of dedicated batch tests were performed in this study to verify the performance and mechanisms of CLQ biodegradation under the ammonium conditions of mainstream wastewater. The results showed that AOB could degrade CLQ in the presence of ammonium oxidation activity, but the capability was limited by the amount of partial nitrification sludge (∼1.1 mg/L at a mixed liquor volatile suspended solids concentration of 200 mg/L). CLQ and its biodegradation products were found to have no significant effect on the ammonium oxidation activity of AOB while the latter would promote N2O production through the AOB denitrification pathway, especially at relatively low DO levels (≤0.5 mg-O2/L). This study provided valuable insights into a more comprehensive assessment of the fate of CLQ in the context of wastewater treatment.


Assuntos
Amônia , Compostos de Amônio , Amônia/metabolismo , Esgotos/microbiologia , Bactérias/metabolismo , Reatores Biológicos/microbiologia , Oxirredução , Óxido Nitroso/análise , Nitrificação , Compostos de Amônio/metabolismo
14.
Nucleic Acids Res ; 50(W1): W434-W447, 2022 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-35524557

RESUMO

The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas. The iFeatureOmega webserver is freely available at http://ifeatureomega.erc.monash.edu and the standalone versions can be downloaded from https://github.com/Superzchen/iFeatureOmega-GUI/ and https://github.com/Superzchen/iFeatureOmega-CLI/.


Assuntos
Biologia Computacional , Ligantes , Software , Proteínas
15.
Angew Chem Int Ed Engl ; 63(21): e202401189, 2024 May 21.
Artigo em Inglês | MEDLINE | ID: mdl-38506220

RESUMO

This study introduces a novel approach for synthesizing Benzoxazine-centered Polychiral Polyheterocycles (BPCPHCs) via an innovative asymmetric carbene-alkyne metathesis-triggered cascade. Overcoming challenges associated with intricate stereochemistry and multiple chiral centers, the catalytic asymmetric Carbene Alkyne Metathesis-mediated Cascade (CAMC) is employed using dirhodium catalyst/Brønsted acid co-catalysis, ensuring precise stereo control as validated by X-ray crystallography. Systematic substrate scope evaluation establishes exceptional diastereo- and enantioselectivities, creating a unique library of BPCPHCs. Pharmacological exploration identifies twelve BPCPHCs as potent Nav ion channel blockers, notably compound 8 g. In vivo studies demonstrate that intrathecal injection of 8 g effectively reverses mechanical hyperalgesia associated with chemotherapy-induced peripheral neuropathy (CIPN), suggesting a promising therapeutic avenue. Electrophysiological investigations unveil the inhibitory effects of 8 g on Nav1.7 currents. Molecular docking, dynamics simulations and surface plasmon resonance (SPR) assay provide insights into the stable complex formation and favorable binding free energy of 8 g with C5aR1. This research represents a significant advancement in asymmetric CAMC for BPCPHCs and unveils BPCPHC 8 g as a promising, uniquely acting pain blocker, establishing a C5aR1-Nav1.7 connection in the context of CIPN.


Assuntos
Alcinos , Benzoxazinas , Metano , Metano/análogos & derivados , Metano/química , Metano/farmacologia , Alcinos/química , Benzoxazinas/química , Benzoxazinas/farmacologia , Benzoxazinas/síntese química , Compostos Heterocíclicos/química , Compostos Heterocíclicos/farmacologia , Compostos Heterocíclicos/síntese química , Humanos , Estereoisomerismo , Analgésicos/química , Analgésicos/farmacologia , Analgésicos/síntese química , Estrutura Molecular , Catálise , Descoberta de Drogas , Animais
16.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33227813

RESUMO

A promoter is a region in the DNA sequence that defines where the transcription of a gene by RNA polymerase initiates, which is typically located proximal to the transcription start site (TSS). How to correctly identify the gene TSS and the core promoter is essential for our understanding of the transcriptional regulation of genes. As a complement to conventional experimental methods, computational techniques with easy-to-use platforms as essential bioinformatics tools can be effectively applied to annotate the functions and physiological roles of promoters. In this work, we propose a deep learning-based method termed Depicter (Deep learning for predicting promoter), for identifying three specific types of promoters, i.e. promoter sequences with the TATA-box (TATA model), promoter sequences without the TATA-box (non-TATA model), and indistinguishable promoters (TATA and non-TATA model). Depicter is developed based on an up-to-date, species-specific dataset which includes Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana promoters. A convolutional neural network coupled with capsule layers is proposed to train and optimize the prediction model of Depicter. Extensive benchmarking and independent tests demonstrate that Depicter achieves an improved predictive performance compared with several state-of-the-art methods. The webserver of Depicter is implemented and freely accessible at https://depicter.erc.monash.edu/.


Assuntos
Bases de Dados de Ácidos Nucleicos , Redes Neurais de Computação , Regiões Promotoras Genéticas , Análise de Sequência de DNA , Software , Transcrição Gênica , Animais , Arabidopsis , Biologia Computacional , Drosophila melanogaster , Humanos , Camundongos
17.
Brief Bioinform ; 22(4)2021 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-33316035

RESUMO

Anti-cancer peptides (ACPs) are known as potential therapeutics for cancer. Due to their unique ability to target cancer cells without affecting healthy cells directly, they have been extensively studied. Many peptide-based drugs are currently evaluated in the preclinical and clinical trials. Accurate identification of ACPs has received considerable attention in recent years; as such, a number of machine learning-based methods for in silico identification of ACPs have been developed. These methods promote the research on the mechanism of ACPs therapeutics against cancer to some extent. There is a vast difference in these methods in terms of their training/testing datasets, machine learning algorithms, feature encoding schemes, feature selection methods and evaluation strategies used. Therefore, it is desirable to summarize the advantages and disadvantages of the existing methods, provide useful insights and suggestions for the development and improvement of novel computational tools to characterize and identify ACPs. With this in mind, we firstly comprehensively investigate 16 state-of-the-art predictors for ACPs in terms of their core algorithms, feature encoding schemes, performance evaluation metrics and webserver/software usability. Then, comprehensive performance assessment is conducted to evaluate the robustness and scalability of the existing predictors using a well-prepared benchmark dataset. We provide potential strategies for the model performance improvement. Moreover, we propose a novel ensemble learning framework, termed ACPredStackL, for the accurate identification of ACPs. ACPredStackL is developed based on the stacking ensemble strategy combined with SVM, Naïve Bayesian, lightGBM and KNN. Empirical benchmarking experiments against the state-of-the-art methods demonstrate that ACPredStackL achieves a comparative performance for predicting ACPs. The webserver and source code of ACPredStackL is freely available at http://bigdata.biocie.cn/ACPredStackL/ and https://github.com/liangxiaoq/ACPredStackL, respectively.


Assuntos
Antineoplásicos , Aprendizado de Máquina , Neoplasias , Software , Antineoplásicos/química , Antineoplásicos/uso terapêutico , Humanos , Neoplasias/tratamento farmacológico , Neoplasias/genética , Peptídeos/química , Peptídeos/genética , Peptídeos/uso terapêutico
18.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32459334

RESUMO

In recent years, high-throughput experimental techniques have significantly enhanced the accuracy and coverage of protein-protein interaction identification, including human-pathogen protein-protein interactions (HP-PPIs). Despite this progress, experimental methods are, in general, expensive in terms of both time and labour costs, especially considering that there are enormous amounts of potential protein-interacting partners. Developing computational methods to predict interactions between human and bacteria pathogen has thus become critical and meaningful, in both facilitating the detection of interactions and mining incomplete interaction maps. In this paper, we present a systematic evaluation of machine learning-based computational methods for human-bacterium protein-protein interactions (HB-PPIs). We first reviewed a vast number of publicly available databases of HP-PPIs and then critically evaluate the availability of these databases. Benefitting from its well-structured nature, we subsequently preprocess the data and identified six bacterium pathogens that could be used to study bacterium subjects in which a human was the host. Additionally, we thoroughly reviewed the literature on 'host-pathogen interactions' whereby existing models were summarized that we used to jointly study the impact of different feature representation algorithms and evaluate the performance of existing machine learning computational models. Owing to the abundance of sequence information and the limited scale of other protein-related information, we adopted the primary protocol from the literature and dedicated our analysis to a comprehensive assessment of sequence information and machine learning models. A systematic evaluation of machine learning models and a wide range of feature representation algorithms based on sequence information are presented as a comparison survey towards the prediction performance evaluation of HB-PPIs.


Assuntos
Interações Hospedeiro-Patógeno , Aprendizado de Máquina , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Biologia Computacional/métodos , Humanos
19.
Brief Bioinform ; 22(3)2021 05 20.
Artigo em Inglês | MEDLINE | ID: mdl-32436937

RESUMO

X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew's correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.


Assuntos
Biologia Computacional/métodos , Cristalização/métodos , Proteínas/química , Sequência de Aminoácidos , Cristalografia por Raios X , Bases de Dados de Proteínas , Modelos Químicos
20.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34226915

RESUMO

Pseudouridine is a ubiquitous RNA modification type present in eukaryotes and prokaryotes, which plays a vital role in various biological processes. Almost all kinds of RNAs are subject to this modification. However, it remains a great challenge to identify pseudouridine sites via experimental approaches, requiring expensive and time-consuming experimental research. Therefore, computational approaches that can be used to perform accurate in silico identification of pseudouridine sites from the large amount of RNA sequence data are highly desirable and can aid in the functional elucidation of this critical modification. Here, we propose a new computational approach, termed Porpoise, to accurately identify pseudouridine sites from RNA sequence data. Porpoise builds upon a comprehensive evaluation of 18 frequently used feature encoding schemes based on the selection of four types of features, including binary features, pseudo k-tuple composition, nucleotide chemical property and position-specific trinucleotide propensity based on single-strand (PSTNPss). The selected features are fed into the stacked ensemble learning framework to enable the construction of an effective stacked model. Both cross-validation tests on the benchmark dataset and independent tests show that Porpoise achieves superior predictive performance than several state-of-the-art approaches. The application of model interpretation tools demonstrates the importance of PSTNPs for the performance of the trained models. This new method is anticipated to facilitate community-wide efforts to identify putative pseudouridine sites and formulate novel testable biological hypothesis.


Assuntos
Biologia Computacional/métodos , Pseudouridina/química , RNA/química , RNA/genética , Algoritmos , Aprendizado de Máquina , Pseudouridina/genética , Reprodutibilidade dos Testes , Análise de Sequência de RNA/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA