Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34415020

RESUMO

Efforts to elucidate protein-DNA interactions at the molecular level rely in part on accurate predictions of DNA-binding residues in protein sequences. While there are over a dozen computational predictors of the DNA-binding residues, they are DNA-type agnostic and significantly cross-predict residues that interact with other ligands as DNA binding. We leverage a custom-designed machine learning architecture to introduce DNAgenie, first-of-its-kind predictor of residues that interact with A-DNA, B-DNA and single-stranded DNA. DNAgenie uses a comprehensive physiochemical profile extracted from an input protein sequence and implements a two-step refinement process to provide accurate predictions and to minimize the cross-predictions. Comparative tests on an independent test dataset demonstrate that DNAgenie outperforms the current methods that we adapt to predict residue-level interactions with the three DNA types. Further analysis finds that the use of the second (refinement) step leads to a substantial reduction in the cross predictions. Empirical tests show that DNAgenie's outputs that are converted to coarse-grained protein-level predictions compare favorably against recent tools that predict which DNA-binding proteins interact with double-stranded versus single-stranded DNAs. Moreover, predictions from the sequences of the whole human proteome reveal that the results produced by DNAgenie substantially overlap with the known DNA-binding proteins while also including promising leads for several hundred previously unknown putative DNA binders. These results suggest that DNAgenie is a valuable tool for the sequence-based characterization of protein functions. The DNAgenie's webserver is available at http://biomine.cs.vcu.edu/servers/DNAgenie/.


Assuntos
Sequência de Bases , Sítios de Ligação , Biologia Computacional/métodos , Proteínas de Ligação a DNA/metabolismo , DNA/química , Software , Sequência de Aminoácidos , DNA/genética , Proteínas de Ligação a DNA/química , Bases de Dados Genéticas , Aprendizado de Máquina , Modelos Moleculares , Ligação Proteica , Reprodutibilidade dos Testes , Relação Estrutura-Atividade , Navegador
2.
Nucleic Acids Res ; 49(D1): D298-D308, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33119734

RESUMO

We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.


Assuntos
Aminoácidos/química , Bases de Dados de Proteínas , Genoma , Proteínas/genética , Proteoma/genética , Software , Sequência de Aminoácidos , Aminoácidos/metabolismo , Animais , Archaea/genética , Archaea/metabolismo , Bactérias/genética , Bactérias/metabolismo , Sítios de Ligação , Sequência Conservada , Fungos/genética , Fungos/metabolismo , Humanos , Internet , Plantas/genética , Plantas/metabolismo , Células Procarióticas/metabolismo , Ligação Proteica , Estrutura Secundária de Proteína , Proteínas/química , Proteínas/classificação , Proteínas/metabolismo , Proteoma/química , Proteoma/metabolismo , Análise de Sequência de Proteína , Vírus/genética , Vírus/metabolismo
3.
Brief Bioinform ; 21(5): 1509-1522, 2020 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-31616935

RESUMO

Experimental annotations of intrinsic disorder are available for 0.1% of 147 000 000 of currently sequenced proteins. Over 60 sequence-based disorder predictors were developed to help bridge this gap. Current benchmarks of these methods assess predictive performance on datasets of proteins; however, predictions are often interpreted for individual proteins. We demonstrate that the protein-level predictive performance varies substantially from the dataset-level benchmarks. Thus, we perform first-of-its-kind protein-level assessment for 13 popular disorder predictors using 6200 disorder-annotated proteins. We show that the protein-level distributions are substantially skewed toward high predictive quality while having long tails of poor predictions. Consequently, between 57% and 75% proteins secure higher predictive performance than the currently used dataset-level assessment suggests, but as many as 30% of proteins that are located in the long tails suffer low predictive performance. These proteins typically have relatively high amounts of disorder, in contrast to the mostly structured proteins that are predicted accurately by all 13 methods. Interestingly, each predictor provides the most accurate results for some number of proteins, while the best-performing at the dataset-level method is in fact the best for only about 30% of proteins. Moreover, the majority of proteins are predicted more accurately than the dataset-level performance of the most accurate tool by at least four disorder predictors. While these results suggests that disorder predictors outperform their current benchmark performance for the majority of proteins and that they complement each other, novel tools that accurately identify the hard-to-predict proteins and that make accurate predictions for these proteins are needed.


Assuntos
Proteínas Intrinsicamente Desordenadas/química , Algoritmos , Biologia Computacional/métodos , Cristalografia por Raios X , Bases de Dados de Proteínas , Conjuntos de Dados como Assunto , Ressonância Magnética Nuclear Biomolecular , Conformação Proteica , Análise de Sequência de Proteína/métodos
4.
Bioinformatics ; 38(1): 115-124, 2021 12 22.
Artigo em Inglês | MEDLINE | ID: mdl-34487138

RESUMO

MOTIVATION: Intrinsically disordered protein regions interact with proteins, nucleic acids and lipids. Regions that bind lipids are implicated in a wide spectrum of cellular functions and several human diseases. Motivated by the growing amount of experimental data for these interactions and lack of tools that can predict them from the protein sequence, we develop DisoLipPred, the first predictor of the disordered lipid-binding residues (DLBRs). RESULTS: DisoLipPred relies on a deep bidirectional recurrent network that implements three innovative features: transfer learning, bypass module that sidesteps predictions for putative structured residues, and expanded inputs that cover physiochemical properties associated with the protein-lipid interactions. Ablation analysis shows that these features drive predictive quality of DisoLipPred. Tests on an independent test dataset and the yeast proteome reveal that DisoLipPred generates accurate results and that none of the related existing tools can be used to indirectly identify DLBR. We also show that DisoLipPred's predictions complement the results generated by predictors of the transmembrane regions. Altogether, we conclude that DisoLipPred provides high-quality predictions of DLBRs that complement the currently available methods. AVAILABILITY AND IMPLEMENTATION: DisoLipPred's webserver is available at http://biomine.cs.vcu.edu/servers/DisoLipPred/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Proteínas Intrinsicamente Desordenadas , Humanos , Biologia Computacional/métodos , Sequência de Aminoácidos , Proteínas Intrinsicamente Desordenadas/química , Aprendizado de Máquina , Lipídeos
5.
Cell Mol Life Sci ; 78(5): 2371-2385, 2021 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-32997198

RESUMO

Intrinsic disorder can be found in all proteomes of all kingdoms of life and in viruses, being particularly prevalent in the eukaryotes. We conduct a comprehensive analysis of the intrinsic disorder in the human proteins while mapping them into 24 compartments of the human cell. In agreement with previous studies, we show that human proteins are significantly enriched in disorder relative to a generic protein set that represents the protein universe. In fact, the fraction of proteins with long disordered regions and the average protein-level disorder content in the human proteome are about 3 times higher than in the protein universe. Furthermore, levels of intrinsic disorder in the majority of human subcellular compartments significantly exceed the average disorder content in the protein universe. Relative to the overall amount of disorder in the human proteome, proteins localized in the nucleus and cytoskeleton have significantly increased amounts of disorder, measured by both high disorder content and presence of multiple long intrinsically disordered regions. We empirically demonstrate that, on average, human proteins are assigned to 2.3 subcellular compartments, with proteins localized to few subcellular compartments being more disordered than the proteins that are localized to many compartments. Functionally, the disordered proteins localized in the most disorder-enriched subcellular compartments are primarily responsible for interactions with nucleic acids and protein partners. This is the first-time disorder is comprehensively mapped into the human cell. Our observations add a missing piece to the puzzle of functional disorder and its organization inside the cell.


Assuntos
Compartimento Celular , Células Eucarióticas/metabolismo , Espaço Intracelular/metabolismo , Proteínas Intrinsicamente Desordenadas/metabolismo , Proteoma/metabolismo , Núcleo Celular/metabolismo , Citoesqueleto/metabolismo , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Proteínas Intrinsicamente Desordenadas/classificação , Modelos Biológicos , Proteoma/classificação
6.
Comput Struct Biotechnol J ; 19: 2597-2606, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34025946

RESUMO

A recent advance in the disorder prediction field is the development of the quality assessment (QA) scores. QA scores complement the propensities produced by the disorder predictors by identifying regions where these predictions are more likely to be correct. We develop, empirically test and release a new QA tool, QUARTERplus, that addresses several key drawbacks of the current QA method, QUARTER. QUARTERplus is the first solution that utilizes QA scores and the associated input disorder predictions to produce very accurate disorder predictions with the help of a modern deep learning meta-model. The deep neural network utilizes the QA scores to identify and fix the regions where the original/input disorder predictions are poor. More importantly, the accurate QUATERplus's predictions are accompanied by easy to interpret residue-level QA scores that reliably quantify their residue-level predictive quality. We provide these interpretable QA scores for QUARTERplus and 10 other popular disorder predictors. Empirical tests on a large and independent (low similarity) test dataset show that QUARTERplus predictions secure AUC = 0.93 and are statistically more accurate than the results of twelve state-of-the-art disorder predictors. We also demonstrate that the new QA scores produced by QUARTERplus are highly correlated with the actual predictive quality and that they can be effectively used to identify regions of correct disorder predictions. This feature empowers the users to easily identify which parts of the predictions generated by the modern disorder predictors are more trustworthy. QUARTERplus is available as a convenient webserver at http://biomine.cs.vcu.edu/servers/QUARTERplus/.

7.
Nat Commun ; 12(1): 4438, 2021 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-34290238

RESUMO

Identification of intrinsic disorder in proteins relies in large part on computational predictors, which demands that their accuracy should be high. Since intrinsic disorder carries out a broad range of cellular functions, it is desirable to couple the disorder and disorder function predictions. We report a computational tool, flDPnn, that provides accurate, fast and comprehensive disorder and disorder function predictions from protein sequences. The recent Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment and results on other test datasets demonstrate that flDPnn offers accurate predictions of disorder, fully disordered proteins and four common disorder functions. These predictions are substantially better than the results of the existing disorder predictors and methods that predict functions of disorder. Ablation tests reveal that the high predictive performance stems from innovative ways used in flDPnn to derive sequence profiles and encode inputs. flDPnn's webserver is available at http://biomine.cs.vcu.edu/servers/flDPnn/.


Assuntos
Biologia Computacional/métodos , Proteínas Intrinsicamente Desordenadas/química , Proteínas Intrinsicamente Desordenadas/metabolismo , Aprendizado de Máquina , Ligação Proteica , Análise de Sequência de Proteína
8.
J Mol Biol ; 433(21): 167229, 2021 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-34487791

RESUMO

Although RNA-binding proteins (RBPs) are known to be enriched in intrinsic disorder, no previous analysis focused on RBPs interacting with specific RNA types. We fill this gap with a comprehensive analysis of the putative disorder in RBPs binding to six common RNA types: messenger RNA (mRNA), transfer RNA (tRNA), small nuclear RNA (snRNA), non-coding RNA (ncRNA), ribosomal RNA (rRNA), and internal ribosome RNA (irRNA). We also analyze the amount of putative intrinsic disorder in the RNA-binding domains (RBDs) and non-RNA-binding-domain regions (non-RBD regions). Consistent with previous studies, we show that in comparison with human proteome, RBPs are significantly enriched in disorder. However, closer examination finds significant enrichment in predicted disorder for the mRNA-, rRNA- and snRNA-binding proteins, while the proteins that interact with ncRNA and irRNA are not enriched in disorder, and the tRNA-binding proteins are significantly depleted in disorder. We show a consistent pattern of significant disorder enrichment in the non-RBD regions coupled with low levels of disorder in RBDs, which suggests that disorder is relatively rarely utilized in the RNA-binding regions. Our analysis of the non-RBD regions suggests that disorder harbors posttranslational modification sites and is involved in the putative interactions with DNA. Importantly, we utilize experimental data from DisProt and independent data from Pfam to validate the above observations that rely on the disorder predictions. This study provides new insights into the distribution of disorder across proteins that bind different RNA types and the functional role of disorder in the regions where it is enriched.


Assuntos
Proteínas Intrinsicamente Desordenadas/química , RNA Mensageiro/química , RNA Ribossômico/química , RNA Nuclear Pequeno/química , RNA de Transferência/química , RNA não Traduzido/química , Proteínas de Ligação a RNA/química , Acetilação , Sítios de Ligação , Expressão Gênica , Humanos , Proteínas Intrinsicamente Desordenadas/genética , Proteínas Intrinsicamente Desordenadas/metabolismo , Metilação , Fosforilação , Ligação Proteica , Processamento de Proteína Pós-Traducional , Proteoma/genética , Proteoma/metabolismo , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , RNA Ribossômico/genética , RNA Ribossômico/metabolismo , RNA Nuclear Pequeno/genética , RNA Nuclear Pequeno/metabolismo , RNA de Transferência/genética , RNA de Transferência/metabolismo , RNA não Traduzido/genética , RNA não Traduzido/metabolismo , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo , Ubiquitinação
9.
Biomolecules ; 10(12)2020 12 04.
Artigo em Inglês | MEDLINE | ID: mdl-33291838

RESUMO

With over 60 disorder predictors, users need help navigating the predictor selection task. We review 28 surveys of disorder predictors, showing that only 11 include assessment of predictive performance. We identify and address a few drawbacks of these past surveys. To this end, we release a novel benchmark dataset with reduced similarity to the training sets of the considered predictors. We use this dataset to perform a first-of-its-kind comparative analysis that targets two large functional families of disordered proteins that interact with proteins and with nucleic acids. We show that limiting sequence similarity between the benchmark and the training datasets has a substantial impact on predictive performance. We also demonstrate that predictive quality is sensitive to the use of the well-annotated order and inclusion of the fully structured proteins in the benchmark datasets, both of which should be considered in future assessments. We identify three predictors that provide favorable results using the new benchmark set. While we find that VSL2B offers the most accurate and robust results overall, ESpritz-DisProt and SPOT-Disorder perform particularly well for disordered proteins. Moreover, we find that predictions for the disordered protein-binding proteins suffer low predictive quality compared to generic disordered proteins and the disordered nucleic acids-binding proteins. This can be explained by the high disorder content of the disordered protein-binding proteins, which makes it difficult for the current methods to accurately identify ordered regions in these proteins. This finding motivates the development of a new generation of methods that would target these difficult-to-predict disordered proteins. We also discuss resources that support users in collecting and identifying high-quality disorder predictions.


Assuntos
Biologia Computacional , Proteínas Intrinsicamente Desordenadas/química , Proteínas Intrinsicamente Desordenadas/metabolismo , Ácidos Nucleicos/metabolismo , Algoritmos , Sequência de Aminoácidos , Bases de Dados de Proteínas , Ligação Proteica , Análise de Sequência de Proteína
10.
Protein Sci ; 29(1): 184-200, 2020 01.
Artigo em Inglês | MEDLINE | ID: mdl-31642118

RESUMO

The intense interest in the intrinsically disordered proteins in the life science community, together with the remarkable advancements in predictive technologies, have given rise to the development of a large number of computational predictors of intrinsic disorder from protein sequence. While the growing number of predictors is a positive trend, we have observed a considerable difference in predictive quality among predictors for individual proteins. Furthermore, variable predictor performance is often inconsistent between predictors for different proteins, and the predictor that shows the best predictive performance depends on the unique properties of each protein sequence. We propose a computational approach, DISOselect, to estimate the predictive performance of 12 selected predictors for individual proteins based on their unique sequence-derived properties. This estimation informs the users about the expected predictive quality for a selected disorder predictor and can be used to recommend methods that are likely to provide the best quality predictions. Our solution does not depend on the results of any disorder predictor; the estimations are made based solely on the protein sequence. Our solution significantly improves predictive performance, as judged with a test set of 1,000 proteins, when compared to other alternatives. We have empirically shown that by using the recommended methods the overall predictive performance for a given set of proteins can be improved by a statistically significant margin. DISOselect is freely available for non-commercial users through the webserver at http://biomine.cs.vcu.edu/servers/DISOselect/.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Proteínas/genética , Algoritmos , Sequência de Aminoácidos , Bases de Dados de Proteínas , Desdobramento de Proteína , Análise de Sequência de Proteína
11.
Pac Symp Biocomput ; 25: 171-182, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31797595

RESUMO

Intrinsically disorder regions (IDRs) lack a stable structure, yet perform biological functions. The functions of IDRs include mediating interactions with other molecules, including proteins, DNA, or RNA and entropic functions, including domain linkers. Computational predictors provide residue-level indications of function for disordered proteins, which contrasts with the need to functionally annotate the thousands of experimentally and computationally discovered IDRs. In this work, we investigate the feasibility of using residue-level prediction methods for region-level function predictions. For an initial examination of the multiple function region-level prediction problem, we constructed a dataset of (likely) single function IDRs in proteins that are dissimilar to the training datasets of the residue-level function predictors. We find that available residue-level prediction methods are only modestly useful in predicting multiple region-level functions. Classification is enhanced by simultaneous use of multiple residue-level function predictions and is further improved by inclusion of amino acids content extracted from the protein sequence. We conclude that multifunction prediction for IDRs is feasible and benefits from the results produced by current residue-level function predictors, however, it has to accommodate inaccuracy in functional annotations.


Assuntos
Proteínas Intrinsicamente Desordenadas , Sequência de Aminoácidos , Biologia Computacional , Simulação por Computador , DNA , Humanos , Proteínas Intrinsicamente Desordenadas/genética
12.
J Mol Biol ; 432(11): 3379-3387, 2020 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-31870849

RESUMO

Computational predictions of the intrinsic disorder and its functions are instrumental to facilitate annotation for the millions of unannotated proteins. However, access to these predictors is fragmented and requires substantial effort to find them and to collect and combine their results. The DEPICTER (DisorderEd PredictIon CenTER) server provides first-of-its-kind centralized access to 10 popular disorder and disorder function predictions that cover protein and nucleic acids binding, linkers, and moonlighting regions. It automates the prediction process, runs user-selected methods on the server side, visualizes the results, and outputs all predictions in a consistent and easy-to-parse format. DEPICTER also includes two accurate consensus predictors of disorder and disordered protein binding. Empirical tests on an independent (low similarity) benchmark dataset reveal that the computational tools included in DEPICTER generate accurate predictions that are significantly better than the results secured using sequence alignment. The DEPICTER server is freely available at http://biomine.cs.vcu.edu/servers/DEPICTER/.


Assuntos
Biologia Computacional , Bases de Dados de Proteínas , Proteínas Intrinsicamente Desordenadas/genética , Software , Sequência de Aminoácidos/genética , Proteínas Intrinsicamente Desordenadas/ultraestrutura , Ligação Proteica/genética , Análise de Sequência de Proteína
13.
Prog Mol Biol Transl Sci ; 166: 341-369, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31521235

RESUMO

Intrinsically disorder regions (IDRs) are abundant in nature, particularly among Eukaryotes. While they facilitate a wide spectrum of cellular functions including signaling, molecular assembly and recognition, translation, transcription and regulation, only several hundred IDRs are annotated functionally. This annotation gap motivates the development of fast and accurate computational methods that predict IDR functions directly from protein sequences. We introduce and describe a comprehensive collection of 25 methods that provide accurate predictions of IDRs that interact with proteins and nucleic acids, that function as flexible linkers and that moonlight multiple functions. Virtually all of these predictors can be accessed online and many were developed in the last few years. They utilize a wide range of predictive architectures and take advantage of modern machine learning algorithms. Our empirical analysis shows that predictors that are available as webservers enjoy high rates of citations, attesting to their practical value and popularity. The most cited methods include DISOPRED3, ANCHOR, alpha-MoRFpred, MoRFpred, fMoRFpred and MoRFCHiBi. We present two case studies to demonstrate that predictions produced by these computational tools are relatively easy to interpret and that they deliver valuable functional clues. However, the current computational tools cover a relatively narrow range of disorder functions. Further development efforts that would cover a broader range of functions should be pursued. We demonstrate that a sufficient amount of functionally annotated IDRs that are associated with several other disorder functions is already available and can be used to design and validate novel predictors.


Assuntos
Biologia Computacional , Proteínas Intrinsicamente Desordenadas/química , Bases de Dados de Proteínas , Humanos , Anotação de Sequência Molecular
14.
Comput Struct Biotechnol J ; 17: 454-462, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31007871

RESUMO

Molecular recognition features (MoRFs) are short protein-binding regions that undergo disorder-to-order transitions (induced folding) upon binding protein partners. These regions are abundant in nature and can be predicted from protein sequences based on their distinctive sequence signatures. This first-of-its-kind survey covers 14 MoRF predictors and six related methods for the prediction of short protein-binding linear motifs, disordered protein-binding regions and semi-disordered regions. We show that the development of MoRF predictors has accelerated in the recent years. These predictors depend on machine learning-derived models that were generated using training datasets where MoRFs are annotated using putative disorder. Our analysis reveals that they generate accurate predictions. We identified eight methods that offer area under the ROC curve (AUC) ≥ 0.7 on experimentally-validated test datasets. We show that modern MoRF predictors accurately find experimentally annotated MoRFs even though they were trained using the putative disorder annotations. They are relatively highly-cited, particularly the methods available as webservers that on average secure three times more citations than methods without this option. MoRF predictions contribute to the experimental discovery of protein-protein interactions, annotation of protein functions and computational analysis of a variety of proteomes, protein families, and pathways. We outline future development and application directions for these tools, stressing the importance to develop novel tools that would target interactions of disordered regions with other types of partners.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA