Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 38(11): 2988-2995, 2022 05 26.
Artigo em Inglês | MEDLINE | ID: mdl-35385080

RESUMO

MOTIVATION: A high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to generate a SA. However, when given a choice of more than one SA for a protein sequence, there are no methods to predict which SA may lead to more accurate models without actually building the models. In this work, we describe a method to predict the quality of a protein's SA. RESULTS: We created our own dataset by generating a variety of SAs for a set of 1351 representative proteins and investigated various deep learning architectures to predict the local distance difference test (lDDT) scores of distance maps predicted with SAs as the input. These lDDT scores serve as indicators of the quality of the SAs.Using two independent test datasets consisting of CASP13 and CASP14 targets, we show that our method is effective for scoring and ranking SAs when a pool of SAs is available for a protein sequence. With an example, we further discuss that SA selection using our method can lead to improved structure prediction. AVAILABILITY AND IMPLEMENTATION: Code and the data underlying this article are available at https://github.com/ba-lab/Alignment-Score/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Alinhamento de Sequência , Biologia Computacional/métodos , Proteínas/química , Sequência de Aminoácidos
2.
IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3586-3594, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34559660

RESUMO

BACKGROUND: Much of the recent success in protein structure prediction has been a result of accurate protein contact prediction-a binary classification problem. Dozens of methods, built from various types of machine learning and deep learning algorithms, have been published over the last two decades for predicting contacts. Recently, many groups, including Google DeepMind, have demonstrated that reformulating the problem as a multi-class classification problem is a more promising direction to pursue. As an alternative approach, we recently proposed real-valued distance predictions, formulating the problem as a regression problem. The nuances of protein 3D structures make this formulation appropriate, allowing predictions to reflect inter-residue distances in nature. Despite these promises, the accurate prediction of real-valued distances remains relatively unexplored; possibly due to classification being better suited to machine and deep learning algorithms. METHODS: Can regression methods be designed to predict real-valued distances as precise as binary contacts? To investigate this, we propose multiple novel methods of input label engineering, which is different from feature engineering, with the goal of optimizing the distribution of distances to cater to the loss function of the deep-learning model. Since an important utility of predicted contacts or distances is to build three-dimensional models, we also tested if predicted distances can reconstruct more accurate models than contacts. RESULTS: Our results demonstrate, for the first time, that deep learning methods for real-valued protein distance prediction can deliver distances as precise as binary classification methods. When using an optimal distance transformation function on the standard PSICOV dataset consisting of 150 representative proteins, the precision of 'top-all' long-range contacts improves from 60.9% to 61.4% when predicting real-valued distances instead of contacts. When building three-dimensional models we observed an average TM-score increase from 0.61 to 0.72, highlighting the advantage of predicting real-valued distances.


Assuntos
Aprendizado Profundo , Biologia Computacional/métodos , Proteínas/química , Algoritmos , Aprendizado de Máquina
3.
Int J Mol Sci ; 22(11)2021 May 24.
Artigo em Inglês | MEDLINE | ID: mdl-34074028

RESUMO

Obtaining an accurate description of protein structure is a fundamental step toward understanding the underpinning of biology. Although recent advances in experimental approaches have greatly enhanced our capabilities to experimentally determine protein structures, the gap between the number of protein sequences and known protein structures is ever increasing. Computational protein structure prediction is one of the ways to fill this gap. Recently, the protein structure prediction field has witnessed a lot of advances due to Deep Learning (DL)-based approaches as evidenced by the success of AlphaFold2 in the most recent Critical Assessment of protein Structure Prediction (CASP14). In this article, we highlight important milestones and progresses in the field of protein structure prediction due to DL-based methods as observed in CASP experiments. We describe advances in various steps of protein structure prediction pipeline viz. protein contact map prediction, protein distogram prediction, protein real-valued distance prediction, and Quality Assessment/refinement. We also highlight some end-to-end DL-based approaches for protein structure prediction approaches. Additionally, as there have been some recent DL-based advances in protein structure determination using Cryo-Electron (Cryo-EM) microscopy based, we also highlight some of the important progress in the field. Finally, we provide an outlook and possible future research directions for DL-based approaches in the protein structure prediction arena.


Assuntos
Biologia Computacional/métodos , Microscopia Crioeletrônica/métodos , Aprendizado Profundo , Proteínas/química , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Bases de Dados de Proteínas , Modelos Moleculares , Redes Neurais de Computação , Conformação Proteica , Software
4.
BMC Bioinformatics ; 22(1): 8, 2021 Jan 06.
Artigo em Inglês | MEDLINE | ID: mdl-33407077

RESUMO

BACKGROUND: Protein inter-residue contact and distance prediction are two key intermediate steps essential to accurate protein structure prediction. Distance prediction comes in two forms: real-valued distances and 'binned' distograms, which are a more finely grained variant of the binary contact prediction problem. The latter has been introduced as a new challenge in the 14th Critical Assessment of Techniques for Protein Structure Prediction (CASP14) 2020 experiment. Despite the recent proliferation of methods for predicting distances, few methods exist for evaluating these predictions. Currently only numerical metrics, which evaluate the entire prediction at once, are used. These give no insight into the structural details of a prediction. For this reason, new methods and tools are needed. RESULTS: We have developed a web server for evaluating predicted inter-residue distances. Our server, DISTEVAL, accepts predicted contacts, distances, and a true structure as optional inputs to generate informative heatmaps, chord diagrams, and 3D models. All of these outputs facilitate visual and qualitative assessment. The server also evaluates predictions using other metrics such as mean absolute error, root mean squared error, and contact precision. CONCLUSIONS: The visualizations generated by DISTEVAL complement each other and collectively serve as a powerful tool for both quantitative and qualitative assessments of predicted contacts and distances, even in the absence of a true 3D structure.


Assuntos
Biologia Computacional/métodos , Internet , Modelos Moleculares , Proteínas , Aminoácidos/química , Aminoácidos/metabolismo , Conformação Proteica , Proteínas/química , Proteínas/metabolismo
5.
Sci Rep ; 10(1): 13374, 2020 08 07.
Artigo em Inglês | MEDLINE | ID: mdl-32770096

RESUMO

As deep learning algorithms drive the progress in protein structure prediction, a lot remains to be studied at this merging superhighway of deep learning and protein structure prediction. Recent findings show that inter-residue distance prediction, a more granular version of the well-known contact prediction problem, is a key to predicting accurate models. However, deep learning methods that predict these distances are still in the early stages of their development. To advance these methods and develop other novel methods, a need exists for a small and representative dataset packaged for faster development and testing. In this work, we introduce protein distance net (PDNET), a framework that consists of one such representative dataset along with the scripts for training and testing deep learning methods. The framework also includes all the scripts that were used to curate the dataset, and generate the input features and distance maps. Deep learning models can also be trained and tested in a web browser using free platforms such as Google Colab. We discuss how PDNET can be used to predict contacts, distance intervals, and real-valued distances.

6.
AIDS ; 34(5): 737-748, 2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-31895148

RESUMO

OBJECTIVE: To develop a predictive model of neurocognitive trajectories in children with perinatal HIV (pHIV). DESIGN: Machine learning analysis of baseline and longitudinal predictors derived from clinical measures utilized in pediatric HIV. METHODS: Two hundred and eighty-five children (ages 2-14 years at baseline; Mage = 6.4 years) with pHIV in Southeast Asia underwent neurocognitive assessment at study enrollment and twice annually thereafter for an average of 5.4 years. Neurocognitive slopes were modeled to establish two subgroups [above (n = 145) and below average (n = 140) trajectories). Gradient-boosted multivariate regressions (GBM) with five-fold cross validation were conducted to examine baseline (pre-ART) and longitudinal predictive features derived from demographic, HIV disease, immune, mental health, and physical health indices (i.e. complete blood count [CBC]). RESULTS: The baseline GBM established a classifier of neurocognitive group designation with an average AUC of 79% built from HIV disease severity and immune markers. GBM analysis of longitudinal predictors with and without interactions improved the average AUC to 87 and 90%, respectively. Mental health problems and hematocrit levels also emerged as salient features in the longitudinal models, with novel interactions between mental health problems and both CD4 cell count and hematocrit levels. Average AUCs derived from each GBM model were higher than results obtained using logistic regression. CONCLUSION: Our findings support the feasibility of machine learning to identify children with pHIV at risk for suboptimal neurocognitive development. Results also suggest that interactions between HIV disease and mental health problems are early antecedents to neurocognitive difficulties in later childhood among youth with pHIV.


Assuntos
Cognição/efeitos dos fármacos , Infecções por HIV/tratamento farmacológico , Infecções por HIV/psicologia , Transmissão Vertical de Doenças Infecciosas , Aprendizado de Máquina , Desempenho Psicomotor/efeitos dos fármacos , Algoritmos , Contagem de Linfócito CD4 , Criança , Pré-Escolar , Função Executiva/efeitos dos fármacos , Feminino , Infecções por HIV/complicações , Humanos , Masculino , Saúde Mental , Parto , Gravidez
7.
Bioinformatics ; 36(2): 470-477, 2020 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-31359036

RESUMO

MOTIVATION: Exciting new opportunities have arisen to solve the protein contact prediction problem from the progress in neural networks and the availability of a large number of homologous sequences through high-throughput sequencing. In this work, we study how deep convolutional neural networks (ConvNets) may be best designed and developed to solve this long-standing problem. RESULTS: With publicly available datasets, we designed and trained various ConvNet architectures. We tested several recent deep learning techniques including wide residual networks, dropouts and dilated convolutions. We studied the improvements in the precision of medium-range and long-range contacts, and compared the performance of our best architectures with the ones used in existing state-of-the-art methods. The proposed ConvNet architectures predict contacts with significantly more precision than the architectures used in several state-of-the-art methods. When trained using the DeepCov dataset consisting of 3456 proteins and tested on PSICOV dataset of 150 proteins, our architectures achieve up to 15% higher precision when L/2 long-range contacts are evaluated. Similarly, when trained using the DNCON2 dataset consisting of 1426 proteins and tested on 84 protein domains in the CASP12 dataset, our single network achieves 4.8% higher precision than the ensembled DNCON2 method when top L long-range contacts are evaluated. AVAILABILITY AND IMPLEMENTATION: DEEPCON is available at https://github.com/badriadhikari/DEEPCON/.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Proteínas
8.
Proteins ; 88(6): 775-787, 2020 06.
Artigo em Inglês | MEDLINE | ID: mdl-31860156

RESUMO

Many proteins are composed of several domains that pack together into a complex tertiary structure. Multidomain proteins can be challenging for protein structure modeling, particularly those for which templates can be found for individual domains but not for the entire sequence. In such cases, homology modeling can generate high quality models of the domains but not for the orientations between domains. Small-angle X-ray scattering (SAXS) reports the structural properties of entire proteins and has the potential for guiding homology modeling of multidomain proteins. In this article, we describe a novel multidomain protein assembly modeling method, SAXSDom that integrates experimental knowledge from SAXS with probabilistic Input-Output Hidden Markov model to assemble the structures of individual domains together. Four SAXS-based scoring functions were developed and tested, and the method was evaluated on multidomain proteins from two public datasets. Incorporation of SAXS information improved the accuracy of domain assembly for 40 out of 46 critical assessment of protein structure prediction multidomain protein targets and 45 out of 73 multidomain protein targets from the ab initio domain assembly dataset. The results demonstrate that SAXS data can provide useful information to improve the accuracy of domain-domain assembly. The source code and tool packages are available at https://github.com/jianlin-cheng/SAXSDom.


Assuntos
Proteínas de Bactérias/química , Caspases/química , Proteínas de Escherichia coli/química , Proteínas de Membrana/química , Software , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Sítios de Ligação , Caspases/genética , Caspases/metabolismo , Cristalografia por Raios X , Escherichia coli/química , Proteínas de Escherichia coli/genética , Proteínas de Escherichia coli/metabolismo , Humanos , Cadeias de Markov , Proteínas de Membrana/genética , Proteínas de Membrana/metabolismo , Modelos Moleculares , Método de Monte Carlo , Ligação Proteica , Conformação Proteica em alfa-Hélice , Conformação Proteica em Folha beta , Domínios e Motivos de Interação entre Proteínas , Estrutura Terciária de Proteína , Rhodobacter capsulatus/química , Espalhamento a Baixo Ângulo , Homologia Estrutural de Proteína , Termodinâmica , Difração de Raios X
9.
Bioinformatics ; 36(4): 1091-1098, 2020 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-31504181

RESUMO

MOTIVATION: Deep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated. RESULTS: We analyzed the results of our three deep learning-based contact prediction methods (MULTICOM-CLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction. AVAILABILITY AND IMPLEMENTATION: https://github.com/multicom-toolbox/DNCON2/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Aprendizado Profundo , Algoritmos , Proteínas , Alinhamento de Sequência
10.
Virol J ; 16(1): 7, 2019 01 11.
Artigo em Inglês | MEDLINE | ID: mdl-30634979

RESUMO

BACKGROUND: Tospoviruses (genus Tospovirus, family Peribunyaviridae, order Bunyavirales) cause significant losses to a wide range of agronomic and horticultural crops worldwide. Identification and characterization of specific sequences and motifs that are critical for virus infection and pathogenicity could provide useful insights and targets for engineering virus resistance that is potentially both broad spectrum and durable. Tomato spotted wilt virus (TSWV), the most prolific member of the group, was used to better understand the structure-function relationships of the nucleocapsid gene (N), and the silencing suppressor gene (NSs), coded by the TSWV small RNA. METHODS: Using a global collection of orthotospoviral sequences, several amino acids that were conserved across the genus and the potential location of these conserved amino acid motifs in these proteins was determined. We used state of the art 3D modeling algorithms, MULTICOM-CLUSTER, MULTICOM-CONSTRUCT, MULTICOM-NOVEL, I-TASSER, ROSETTA and CONFOLD to predict the secondary and tertiary structures of the N and the NSs proteins. RESULTS: We identified nine amino acid residues in the N protein among 31 known tospoviral species, and ten amino acid residues in NSs protein among 27 tospoviral species that were conserved across the genus. For the N protein, all three algorithms gave nearly identical tertiary models. While the conserved residues were distributed throughout the protein on a linear scale, at the tertiary level, three residues were consistently located in the coil in all the models. For NSs protein models, there was no agreement among the three algorithms. However, with respect to the localization of the conserved motifs, G18 was consistently located in coil, while H115 was localized in the coil in three models. CONCLUSIONS: This is the first report of predicting the 3D structure of any tospoviral NSs protein and revealed a consistent location for two of the ten conserved residues. The modelers used gave accurate prediction for N protein allowing the localization of the conserved residues. Results form the basis for further work on the structure-function relationships of tospoviral proteins and could be useful in developing novel virus control strategies targeting the conserved residues.


Assuntos
Conformação Molecular , Proteínas do Nucleocapsídeo/química , Nucleoproteínas/química , Tospovirus/genética , Motivos de Aminoácidos , Sequência de Aminoácidos , Sequência Conservada , Inativação Gênica , Proteínas do Nucleocapsídeo/genética , Nucleoproteínas/genética , RNA Viral , Tospovirus/química
11.
Sci Rep ; 8(1): 9939, 2018 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-29967418

RESUMO

Every two years groups worldwide participate in the Critical Assessment of Protein Structure Prediction (CASP) experiment to blindly test the strengths and weaknesses of their computational methods. CASP has significantly advanced the field but many hurdles still remain, which may require new ideas and collaborations. In 2012 a web-based effort called WeFold, was initiated to promote collaboration within the CASP community and attract researchers from other fields to contribute new ideas to CASP. Members of the WeFold coopetition (cooperation and competition) participated in CASP as individual teams, but also shared components of their methods to create hybrid pipelines and actively contributed to this effort. We assert that the scale and diversity of integrative prediction pipelines could not have been achieved by any individual lab or even by any collaboration among a few partners. The models contributed by the participating groups and generated by the pipelines are publicly available at the WeFold website providing a wealth of data that remains to be tapped. Here, we analyze the results of the 2014 and 2016 pipelines showing improvements according to the CASP assessment as well as areas that require further adjustments and research.


Assuntos
Caspase 12/metabolismo , Caspases/metabolismo , Biologia Computacional/métodos , Modelos Moleculares , Software , Caspase 12/química , Caspases/química , Humanos , Conformação Proteica
13.
BMC Bioinformatics ; 19(1): 22, 2018 01 25.
Artigo em Inglês | MEDLINE | ID: mdl-29370750

RESUMO

BACKGROUND: Contact-guided protein structure prediction methods are becoming more and more successful because of the latest advances in residue-residue contact prediction. To support contact-driven structure prediction, effective tools that can quickly build tertiary structural models of good quality from predicted contacts need to be developed. RESULTS: We develop an improved contact-driven protein modelling method, CONFOLD2, and study how it may be effectively used for ab initio protein structure prediction with predicted contacts as input. It builds models using various subsets of input contacts to explore the fold space under the guidance of a soft square energy function, and then clusters the models to obtain the top five models. CONFOLD2 obtains an average reconstruction accuracy of 0.57 TM-score for the 150 proteins in the PSICOV contact prediction dataset. When benchmarked on the CASP11 contacts predicted using CONSIP2 and CASP12 contacts predicted using Raptor-X, CONFOLD2 achieves a mean TM-score of 0.41 on both datasets. CONCLUSION: CONFOLD2 allows to quickly generate top five structural models for a protein sequence when its secondary structures and contacts predictions at hand. The source code of CONFOLD2 is publicly available at https://github.com/multicom-toolbox/CONFOLD2/ .


Assuntos
Proteínas/química , Interface Usuário-Computador , Algoritmos , Bases de Dados de Proteínas , Internet , Conformação Proteica , Dobramento de Proteína , Proteínas/metabolismo
14.
Bioinformatics ; 34(9): 1466-1472, 2018 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-29228185

RESUMO

Motivation: Significant improvements in the prediction of protein residue-residue contacts are observed in the recent years. These contacts, predicted using a variety of coevolution-based and machine learning methods, are the key contributors to the recent progress in ab initio protein structure prediction, as demonstrated in the recent CASP experiments. Continuing the development of new methods to reliably predict contact maps is essential to further improve ab initio structure prediction. Results: In this paper we discuss DNCON2, an improved protein contact map predictor based on two-level deep convolutional neural networks. It consists of six convolutional neural networks-the first five predict contacts at 6, 7.5, 8, 8.5 and 10 Å distance thresholds, and the last one uses these five predictions as additional features to predict final contact maps. On the free-modeling datasets in CASP10, 11 and 12 experiments, DNCON2 achieves mean precisions of 35, 50 and 53.4%, respectively, higher than 30.6% by MetaPSICOV on CASP10 dataset, 34% by MetaPSICOV on CASP11 dataset and 46.3% by Raptor-X on CASP12 dataset, when top L/5 long-range contacts are evaluated. We attribute the improved performance of DNCON2 to the inclusion of short- and medium-range contacts into training, two-level approach to prediction, use of the state-of-the-art optimization and activation functions, and a novel deep learning architecture that allows each filter in a convolutional layer to access all the input features of a protein of arbitrary length. Availability and implementation: The web server of DNCON2 is at http://sysbio.rnet.missouri.edu/dncon2/ where training and testing datasets as well as the predictions for CASP10, 11 and 12 free-modeling datasets can also be downloaded. Its source code is available at https://github.com/multicom-toolbox/DNCON2/. Contact: chengji@missouri.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Conformação Proteica , Caspases/química , Caspases/metabolismo , Biologia Computacional/métodos , Análise de Sequência de Proteína
15.
Bioinformatics ; 34(8): 1295-1303, 2018 04 15.
Artigo em Inglês | MEDLINE | ID: mdl-29228193

RESUMO

Motivation: Protein fold recognition is an important problem in structural bioinformatics. Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a target protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold. Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice. Results: We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein sequence into one of 1195 known folds, which is useful for both fold recognition and the study of sequence-structure relationship. Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and maps it to the fold space. We train and test our method on the datasets curated from SCOP1.75, yielding an average classification accuracy of 75.3%. On the independent testing dataset curated from SCOP2.06, the classification accuracy is 73.0%. We compare our method with a top profile-profile alignment method-HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is 12.63-26.32% higher than HHSearch on template-free modeling targets and 3.39-17.09% higher on hard template-based modeling targets for top 1, 5 and 10 predicted folds. The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking. Availability and implementation: The DeepSF server is publicly available at: http://iris.rnet.missouri.edu/DeepSF/. Contact: chengji@missouri.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Dobramento de Proteína , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Animais , Biologia Computacional/métodos , Humanos , Modelos Moleculares
16.
Proteins ; 86 Suppl 1: 84-96, 2018 03.
Artigo em Inglês | MEDLINE | ID: mdl-29047157

RESUMO

In this study, we report the evaluation of the residue-residue contacts predicted by our three different methods in the CASP12 experiment, focusing on studying the impact of multiple sequence alignment, residue coevolution, and machine learning on contact prediction. The first method (MULTICOM-NOVEL) uses only traditional features (sequence profile, secondary structure, and solvent accessibility) with deep learning to predict contacts and serves as a baseline. The second method (MULTICOM-CONSTRUCT) uses our new alignment algorithm to generate deep multiple sequence alignment to derive coevolution-based features, which are integrated by a neural network method to predict contacts. The third method (MULTICOM-CLUSTER) is a consensus combination of the predictions of the first two methods. We evaluated our methods on 94 CASP12 domains. On a subset of 38 free-modeling domains, our methods achieved an average precision of up to 41.7% for top L/5 long-range contact predictions. The comparison of the three methods shows that the quality and effective depth of multiple sequence alignments, coevolution-based features, and machine learning integration of coevolution-based features and traditional features drive the quality of predicted protein contacts. On the full CASP12 dataset, the coevolution-based features alone can improve the average precision from 28.4% to 41.6%, and the machine learning integration of all the features further raises the precision to 56.3%, when top L/5 predicted long-range contacts are evaluated. And the correlation between the precision of contact prediction and the logarithm of the number of effective sequences in alignments is 0.66.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Modelos Moleculares , Conformação Proteica , Proteínas/química , Algoritmos , Cristalografia por Raios X , Bases de Dados de Proteínas , Conjuntos de Dados como Assunto , Humanos , Dobramento de Proteína , Alinhamento de Sequência , Análise de Sequência de Proteína
17.
BMC Bioinformatics ; 18(1): 417, 2017 Sep 18.
Artigo em Inglês | MEDLINE | ID: mdl-28923002

RESUMO

BACKGROUND: Deep learning is one of the most powerful machine learning methods that has achieved the state-of-the-art performance in many domains. Since deep learning was introduced to the field of bioinformatics in 2012, it has achieved success in a number of areas such as protein residue-residue contact prediction, secondary structure prediction, and fold recognition. In this work, we developed deep learning methods to improve the prediction of torsion (dihedral) angles of proteins. RESULTS: We design four different deep learning architectures to predict protein torsion angles. The architectures including deep neural network (DNN) and deep restricted Boltzmann machine (DRBN), deep recurrent neural network (DRNN) and deep recurrent restricted Boltzmann machine (DReRBM) since the protein torsion angle prediction is a sequence related problem. In addition to existing protein features, two new features (predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments) are used as input to each of the four deep learning architectures to predict phi and psi angles of protein backbone. The mean absolute error (MAE) of phi and psi angles predicted by DRNN, DReRBM, DRBM and DNN is about 20-21° and 29-30° on an independent dataset. The MAE of phi angle is comparable to the existing methods, but the MAE of psi angle is 29°, 2° lower than the existing methods. On the latest CASP12 targets, our methods also achieved the performance better than or comparable to a state-of-the art method. CONCLUSIONS: Our experiment demonstrates that deep learning is a valuable method for predicting protein torsion angles. The deep recurrent network architecture performs slightly better than deep feed-forward architecture, and the predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments are useful features for improving prediction accuracy.


Assuntos
Aprendizado de Máquina , Proteínas/química , Estrutura Molecular , Redes Neurais de Computação , Estrutura Secundária de Proteína
18.
BMC Bioinformatics ; 18(1): 380, 2017 Aug 29.
Artigo em Inglês | MEDLINE | ID: mdl-28851269

RESUMO

BACKGROUND: Residue-residue contacts are key features for accurate de novo protein structure prediction. For the optimal utilization of these predicted contacts in folding proteins accurately, it is important to study the challenges of reconstructing protein structures using true contacts. Because contact-guided protein modeling approach is valuable for predicting the folds of proteins that do not have structural templates, it is necessary for reconstruction studies to focus on hard-to-predict protein structures. RESULTS: Using a data set consisting of 496 structural domains released in recent CASP experiments and a dataset of 150 representative protein structures, in this work, we discuss three techniques to improve the reconstruction accuracy using true contacts - adding secondary structures, increasing contact distance thresholds, and adding non-contacts. We find that reconstruction using secondary structures and contacts can deliver accuracy higher than using full contact maps. Similarly, we demonstrate that non-contacts can improve reconstruction accuracy not only when the used non-contacts are true but also when they are predicted. On the dataset consisting of 150 proteins, we find that by simply using low ranked predicted contacts as non-contacts and adding them as additional restraints, can increase the reconstruction accuracy by 5% when the reconstructed models are evaluated using TM-score. CONCLUSIONS: Our findings suggest that secondary structures are invaluable companions of contacts for accurate reconstruction. Confirming some earlier findings, we also find that larger distance thresholds are useful for folding many protein structures which cannot be folded using the standard definition of contacts. Our findings also suggest that for more accurate reconstruction using predicted contacts it is useful to predict contacts at higher distance thresholds (beyond 8 Å) and predict non-contacts.


Assuntos
Modelos Moleculares , Proteínas/química , Caspases/química , Caspases/metabolismo , Estrutura Secundária de Proteína , Proteínas/metabolismo
19.
Bioinformatics ; 33(4): 586-588, 2017 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-28035027

RESUMO

Motivation: Protein model quality assessment (QA) plays a very important role in protein structure prediction. It can be divided into two groups of methods: single model and consensus QA method. The consensus QA methods may fail when there is a large portion of low quality models in the model pool. Results: In this paper, we develop a novel single-model quality assessment method QAcon utilizing structural features, physicochemical properties, and residue contact predictions. We apply residue-residue contact information predicted by two protein contact prediction methods PSICOV and DNcon to generate a new score as feature for quality assessment. This novel feature and other 11 features are used as input to train a two-layer neural network on CASP9 datasets to predict the quality of a single protein model. We blindly benchmarked our method QAcon on CASP11 dataset as the MULTICOM-CLUSTER server. Based on the evaluation, our method is ranked as one of the top single model QA methods. The good performance of the features based on contact prediction illustrates the value of using contact information in protein quality assessment. Availability and Implementation: The web server and the source code of QAcon are freely available at: http://cactus.rnet.missouri.edu/QAcon. Contact: chengji@missouri.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado de Máquina , Modelos Moleculares , Proteínas/química , Animais , Humanos , Conformação Proteica , Proteínas/metabolismo , Controle de Qualidade
20.
Methods Mol Biol ; 1484: 115-126, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-27787823

RESUMO

Recent successes of contact-guided protein structure prediction methods have revived interest in solving the long-standing problem of ab initio protein structure prediction. With homology modeling failing for many protein sequences that do not have templates, contact-guided structure prediction has shown promise, and consequently, contact prediction has gained a lot of interest recently. Although a few dozen contact prediction tools are already currently available as web servers and downloadables, not enough research has been done towards using existing measures like precision and recall to evaluate these contacts with the goal of building three-dimensional models. Moreover, when we do not have a native structure for a set of predicted contacts, the only analysis we can perform is a simple contact map visualization of the predicted contacts. A wider and more rigorous assessment of the predicted contacts is needed, in order to build tertiary structure models. This chapter discusses instructions and protocols for using tools and applying techniques in order to assess predicted contacts for building three-dimensional models.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Software , Algoritmos , Bases de Dados de Proteínas , Redes Neurais de Computação , Conformação Proteica , Dobramento de Proteína , Proteínas/genética , Análise de Sequência de Proteína
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...