Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
Bioinformatics ; 39(3)2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36916746

RESUMO

MOTIVATION: Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired. RESULTS: Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein. AVAILABILITY AND IMPLEMENTATION: The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.


Assuntos
Proteínas , Software , Sequência de Aminoácidos , Proteínas/química
2.
BMC Bioinformatics ; 22(1): 439, 2021 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-34525939

RESUMO

BACKGROUND: Accurate prediction of protein tertiary structures is highly desired as the knowledge of protein structures provides invaluable insights into protein functions. We have designed two approaches to protein structure prediction, including a template-based modeling approach (called ProALIGN) and an ab initio prediction approach (called ProFOLD). Briefly speaking, ProALIGN aligns a target protein with templates through exploiting the patterns of context-specific alignment motifs and then builds the final structure with reference to the homologous templates. In contrast, ProFOLD uses an end-to-end neural network to estimate inter-residue distances of target proteins and builds structures that satisfy these distance constraints. These two approaches emphasize different characteristics of target proteins: ProALIGN exploits structure information of homologous templates of target proteins while ProFOLD exploits the co-evolutionary information carried by homologous protein sequences. Recent progress has shown that the combination of template-based modeling and ab initio approaches is promising. RESULTS: In the study, we present FALCON2, a web server that integrates ProALIGN and ProFOLD to provide high-quality protein structure prediction service. For a target protein, FALCON2 executes ProALIGN and ProFOLD simultaneously to predict possible structures and selects the most likely one as the final prediction result. We evaluated FALCON2 on widely-used benchmarks, including 104 CASP13 (the 13th Critical Assessment of protein Structure Prediction) targets and 91 CASP14 targets. In-depth examination suggests that when high-quality templates are available, ProALIGN is superior to ProFOLD and in other cases, ProFOLD shows better performance. By integrating these two approaches with different emphasis, FALCON2 server outperforms the two individual approaches and also achieves state-of-the-art performance compared with existing approaches. CONCLUSIONS: By integrating template-based modeling and ab initio approaches, FALCON2 provides an easy-to-use and high-quality protein structure prediction service for the community and we expect it to enable insights into a deep understanding of protein functions.


Assuntos
Redes Neurais de Computação , Proteínas , Sequência de Aminoácidos , Computadores , Conformação Proteica , Software
3.
BMC Genomics ; 21(Suppl 11): 878, 2020 Dec 29.
Artigo em Inglês | MEDLINE | ID: mdl-33372607

RESUMO

BACKGROUND: Accurate prediction of protein structure is fundamentally important to understand biological function of proteins. Template-based modeling, including protein threading and homology modeling, is a popular method for protein tertiary structure prediction. However, accurate template-query alignment and template selection are still very challenging, especially for the proteins with only distant homologs available. RESULTS: We propose a new template-based modelling method called ThreaderAI to improve protein tertiary structure prediction. ThreaderAI formulates the task of aligning query sequence with template as the classical pixel classification problem in computer vision and naturally applies deep residual neural network in prediction. ThreaderAI first employs deep learning to predict residue-residue aligning probability matrix by integrating sequence profile, predicted sequential structural features, and predicted residue-residue contacts, and then builds template-query alignment by applying a dynamic programming algorithm on the probability matrix. We evaluated our methods both in generating accurate template-query alignment and protein threading. Experimental results show that ThreaderAI outperforms currently popular template-based modelling methods HHpred, CNFpred, and the latest contact-assisted method CEthreader, especially on the proteins that do not have close homologs with known structures. In particular, in terms of alignment accuracy measured with TM-score, ThreaderAI outperforms HHpred, CNFpred, and CEthreader by 56, 13, and 11%, respectively, on template-query pairs at the similarity of fold level from SCOPe data. And on CASP13's TBM-hard data, ThreaderAI outperforms HHpred, CNFpred, and CEthreader by 16, 9 and 8% in terms of TM-score, respectively. CONCLUSIONS: These results demonstrate that with the help of deep learning, ThreaderAI can significantly improve the accuracy of template-based structure prediction, especially for distant-homology proteins.


Assuntos
Aprendizado Profundo , Análise de Sequência de Proteína , Algoritmos , Bases de Dados de Proteínas , Conformação Proteica , Proteínas , Software
4.
BMC Bioinformatics ; 20(Suppl 3): 135, 2019 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-30925867

RESUMO

BACKGROUND: The ab initio approaches to protein structure prediction usually employ the Monte Carlo technique to search the structural conformation that has the lowest energy. However, the widely-used energy functions are usually ineffective for conformation search. How to construct an effective energy function remains a challenging task. RESULTS: Here, we present a framework to construct effective energy functions for protein structure prediction. Unlike existing energy functions only requiring the native structure to be the lowest one, we attempt to maximize the attraction-basin where the native structure lies in the energy landscape. The underlying rationale is that each energy function determines a specific energy landscape together with a native attraction-basin, and the larger the attraction-basin is, the more likely for the Monte Carlo search procedure to find the native structure. Following this rationale, we constructed effective energy functions as follows: i) To explore the native attraction-basin determined by a certain energy function, we performed reverse Monte Carlo sampling starting from the native structure, identifying the structural conformations on the edge of attraction-basin. ii) To broaden the native attraction-basin, we smoothened the edge points of attraction-basin through tuning weights of energy terms, thus acquiring an improved energy function. Our framework alternates the broadening attraction-basin and reverse sampling steps (thus called BARS) until the native attraction-basin is sufficiently large. We present extensive experimental results to show that using the BARS framework, the constructed energy functions could greatly facilitate protein structure prediction in improving the quality of predicted structures and speeding up conformation search. CONCLUSION: Using the BARS framework, we constructed effective energy functions for protein structure prediction, which could improve the quality of predicted structures and speed up conformation search as well.


Assuntos
Biologia Computacional/métodos , Método de Monte Carlo , Proteínas/química , Algoritmos , Bases de Dados de Proteínas , Conformação Proteica , Termodinâmica
5.
BMC Bioinformatics ; 20(1): 537, 2019 Oct 29.
Artigo em Inglês | MEDLINE | ID: mdl-31664895

RESUMO

BACKGROUND: Accurate prediction of inter-residue contacts of a protein is important to calculating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective in inferring inter-residue contacts. The Markov random field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate; in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccurate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. RESULTS: In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite-likelihood, i.e., the product of conditional probability of all residue pairs. Composite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, including PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction accuracy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present a successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset. CONCLUSIONS: Composite likelihood maximization algorithm can efficiently estimate the parameters of Markov Random Fields and can improve the prediction accuracy of protein inter-residue contacts.


Assuntos
Aprendizado Profundo , Proteínas/química , Algoritmos , Probabilidade
6.
BMC Bioinformatics ; 20(1): 616, 2019 Nov 29.
Artigo em Inglês | MEDLINE | ID: mdl-31783729

RESUMO

Following publication of the original article [1], the author explained that there are several errors in the original article.

7.
Bioinformatics ; 33(23): 3749-3757, 2017 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-28961795

RESUMO

MOTIVATION: Accurate recognition of protein fold types is a key step for template-based prediction of protein structures. The existing approaches to fold recognition mainly exploit the features derived from alignments of query protein against templates. These approaches have been shown to be successful for fold recognition at family level, but usually failed at superfamily/fold levels. To overcome this limitation, one of the key points is to explore more structurally informative features of proteins. Although residue-residue contacts carry abundant structural information, how to thoroughly exploit these information for fold recognition still remains a challenge. RESULTS: In this study, we present an approach (called DeepFR) to improve fold recognition at superfamily/fold levels. The basic idea of our approach is to extract fold-specific features from predicted residue-residue contacts of proteins using deep convolutional neural network (DCNN) technique. Based on these fold-specific features, we calculated similarity between query protein and templates, and then assigned query protein with fold type of the most similar template. DCNN has showed excellent performance in image feature extraction and image recognition; the rational underlying the application of DCNN for fold recognition is that contact likelihood maps are essentially analogy to images, as they both display compositional hierarchy. Experimental results on the LINDAHL dataset suggest that even using the extracted fold-specific features alone, our approach achieved success rate comparable to the state-of-the-art approaches. When further combining these features with traditional alignment-related features, the success rate of our approach increased to 92.3%, 82.5% and 78.8% at family, superfamily and fold levels, respectively, which is about 18% higher than the state-of-the-art approach at fold level, 6% higher at superfamily level and 1% higher at family level. An independent assessment on SCOP_TEST dataset showed consistent performance improvement, indicating robustness of our approach. Furthermore, bi-clustering results of the extracted features are compatible with fold hierarchy of proteins, implying that these features are fold-specific. Together, these results suggest that the features extracted from predicted contacts are orthogonal to alignment-related features, and the combination of them could greatly facilitate fold recognition at superfamily/fold levels and template-based prediction of protein structures. AVAILABILITY AND IMPLEMENTATION: Source code of DeepFR is freely available through https://github.com/zhujianwei31415/deepfr, and a web server is available through http://protein.ict.ac.cn/deepfr. CONTACT: zheng@itp.ac.cn or dbu@ict.ac.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Dobramento de Proteína , Algoritmos , Redes Neurais de Computação , Proteínas/química , Software
8.
BMC Bioinformatics ; 18(Suppl 3): 70, 2017 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-28361691

RESUMO

BACKGROUND: Residues in a protein might be buried inside or exposed to the solvent surrounding the protein. The buried residues usually form hydrophobic cores to maintain the structural integrity of proteins while the exposed residues are tightly related to protein functions. Thus, the accurate prediction of solvent accessibility of residues will greatly facilitate our understanding of both structure and functionalities of proteins. Most of the state-of-the-art prediction approaches consider the burial state of each residue independently, thus neglecting the correlations among residues. RESULTS: In this study, we present a high-order conditional random field model that considers burial states of all residues in a protein simultaneously. Our approach exploits not only the correlation among adjacent residues but also the correlation among long-range residues. Experimental results showed that by exploiting the correlation among residues, our approach outperformed the state-of-the-art approaches in prediction accuracy. In-depth case studies also showed that by using the high-order statistical model, the errors committed by the bidirectional recurrent neural network and chain conditional random field models were successfully corrected. CONCLUSIONS: Our methods enable the accurate prediction of residue burial states, which should greatly facilitate protein structure prediction and evaluation.


Assuntos
Modelos Teóricos , Proteínas/química , Bases de Dados Factuais , Interações Hidrofóbicas e Hidrofílicas , Conformação Proteica , Reprodutibilidade dos Testes , Solventes/química
9.
Bioinformatics ; 32(3): 462-4, 2016 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-26454278

RESUMO

SUMMARY: The protein structure prediction approaches can be categorized into template-based modeling (including homology modeling and threading) and free modeling. However, the existing threading tools perform poorly on remote homologous proteins. Thus, improving fold recognition for remote homologous proteins remains a challenge. Besides, the proteome-wide structure prediction poses another challenge of increasing prediction throughput. In this study, we presented FALCON@home as a protein structure prediction server focusing on remote homologue identification. The design of FALCON@home is based on the observation that a structural template, especially for remote homologous proteins, consists of conserved regions interweaved with highly variable regions. The highly variable regions lead to vague alignments in threading approaches. Thus, FALCON@home first extracts conserved regions from each template and then aligns a query protein with conserved regions only rather than the full-length template directly. This helps avoid the vague alignments rooted in highly variable regions, improving remote homologue identification. We implemented FALCON@home using the Berkeley Open Infrastructure of Network Computing (BOINC) volunteer computing protocol. With computation power donated from over 20,000 volunteer CPUs, FALCON@home shows a throughput as high as processing of over 1000 proteins per day. In the Critical Assessment of protein Structure Prediction (CASP11), the FALCON@home-based prediction was ranked the 12th in the template-based modeling category. As an application, the structures of 880 mouse mitochondria proteins were predicted, which revealed the significant correlation between protein half-lives and protein structural factors. AVAILABILITY AND IMPLEMENTATION: FALCON@home is freely available at http://protein.ict.ac.cn/FALCON/. CONTACT: shuaicli@cityu.edu.hk, dbu@ict.ac.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Conformação Proteica , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Animais , Biologia Computacional/métodos , Bases de Dados de Proteínas , Ensaios de Triagem em Larga Escala , Camundongos
10.
Biochem Biophys Res Commun ; 472(1): 217-22, 2016 Mar 25.
Artigo em Inglês | MEDLINE | ID: mdl-26920058

RESUMO

Strategies for correlation analysis in protein contact prediction often encounter two challenges, namely, the indirect coupling among residues, and the background correlations mainly caused by phylogenetic biases. While various studies have been conducted on how to disentangle indirect coupling, the removal of background correlations still remains unresolved. Here, we present an approach for removing background correlations via low-rank and sparse decomposition (LRS) of a residue correlation matrix. The correlation matrix can be constructed using either local inference strategies (e.g., mutual information, or MI) or global inference strategies (e.g., direct coupling analysis, or DCA). In our approach, a correlation matrix was decomposed into two components, i.e., a low-rank component representing background correlations, and a sparse component representing true correlations. Finally the residue contacts were inferred from the sparse component of correlation matrix. We trained our LRS-based method on the PSICOV dataset, and tested it on both GREMLIN and CASP11 datasets. Our experimental results suggested that LRS significantly improves the contact prediction precision. For example, when equipped with the LRS technique, the prediction precision of MI and mfDCA increased from 0.25 to 0.67 and from 0.58 to 0.70, respectively (Top L/10 predicted contacts, sequence separation: 5 AA, dataset: GREMLIN). In addition, our LRS technique also consistently outperforms the popular denoising technique APC (average product correction), on both local (MI_LRS: 0.67 vs MI_APC: 0.34) and global measures (mfDCA_LRS: 0.70 vs mfDCA_APC: 0.67). Interestingly, we found out that when equipped with our LRS technique, local inference strategies performed in a comparable manner to that of global inference strategies, implying that the application of LRS technique narrowed down the performance gap between local and global inference strategies. Overall, our LRS technique greatly facilitates protein contact prediction by removing background correlations. An implementation of the approach called COLORS (improving COntact prediction using LOw-Rank and Sparse matrix decomposition) is available from http://protein.ict.ac.cn/COLORS/.


Assuntos
Domínios e Motivos de Interação entre Proteínas , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Simulação por Computador , Bases de Dados de Proteínas , Evolução Molecular , Modelos Moleculares , Modelos Estatísticos , Filogenia , Análise de Componente Principal , Conformação Proteica , Dobramento de Proteína , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Mapas de Interação de Proteínas , Análise de Sequência de Proteína
11.
Genomics Proteomics Bioinformatics ; 21(5): 913-925, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37001856

RESUMO

Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.


Assuntos
Algoritmos , Proteínas , Conformação Proteica , Proteínas/química , Redes Neurais de Computação , Dobramento de Proteína , Biologia Computacional/métodos
12.
PLoS One ; 17(4): e0266598, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35413070

RESUMO

Although social media has highly facilitated people's daily communication and dissemination of information, it has unfortunately been an ideal hotbed for the breeding and dissemination of Internet rumors. Therefore, automatically monitoring rumor dissemination in the early stage is of great practical significance. However, the existing detection methods fail to take full advantage of the semantics of the microblog information propagation graph. To address this shortcoming, this study models the information transmission network of a microblog as a heterogeneous graph with a variety of semantic information and then constructs a Microblog-HAN, which is a graph-based rumor detection model, to capture and aggregate the semantic information using attention layers. Specifically, after the initial textual and visual features of posts are extracted, the node-level attention mechanism combines neighbors of the microblog nodes to generate three groups of node embeddings with specific semantics. Moreover, semantic-level attention fuses different semantics to obtain the final node embedding of the microblog, which is then used as a classifier's input. Finally, the classification results of whether the microblog is a rumor or not are obtained. The experimental results on two real-world microblog rumor datasets, Weibo2016 and Weibo2021, demonstrate that the proposed Microblog-HAN can detect microblog rumors with an accuracy of over 92%, demonstrating its superiority over the most existing methods in identifying rumors from the view of the whole information transmission graph.


Assuntos
Mídias Sociais , Humanos , Semântica
13.
Nat Mach Intell ; 4(11): 1017-1028, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37484202

RESUMO

Accurate prediction of damaging missense variants is critically important for interpreting a genome sequence. Although many methods have been developed, their performance has been limited. Recent advances in machine learning and the availability of large-scale population genomic sequencing data provide new opportunities to considerably improve computational predictions. Here we describe the graphical missense variant pathogenicity predictor (gMVP), a new method based on graph attention neural networks. Its main component is a graph with nodes that capture predictive features of amino acids and edges weighted by co-evolution strength, enabling effective pooling of information from the local protein context and functionally correlated distal positions. Evaluation of deep mutational scan data shows that gMVP outperforms other published methods in identifying damaging variants in TP53, PTEN, BRCA1 and MSH2. Furthermore, it achieves the best separation of de novo missense variants in neuro developmental disorder cases from those in controls. Finally, the model supports transfer learning to optimize gain- and loss-of-function predictions in sodium and calcium channels. In summary, we demonstrate that gMVP can improve interpretation of missense variants in clinical testing and genetic studies.

14.
Nat Commun ; 12(1): 510, 2021 01 21.
Artigo em Inglês | MEDLINE | ID: mdl-33479230

RESUMO

Accurate pathogenicity prediction of missense variants is critically important in genetic studies and clinical diagnosis. Previously published prediction methods have facilitated the interpretation of missense variants but have limited performance. Here, we describe MVP (Missense Variant Pathogenicity prediction), a new prediction method that uses deep residual network to leverage large training data sets and many correlated predictors. We train the model separately in genes that are intolerant of loss of function variants and the ones that are tolerant in order to take account of potentially different genetic effect size and mode of action. We compile cancer mutation hotspots and de novo variants from developmental disorders for benchmarking. Overall, MVP achieves better performance in prioritizing pathogenic missense variants than previous methods, especially in genes tolerant of loss of function variants. Finally, using MVP, we estimate that de novo coding variants contribute to 7.8% of isolated congenital heart disease, nearly doubling previous estimates.


Assuntos
Biologia Computacional/métodos , Aprendizado Profundo , Predisposição Genética para Doença/genética , Mutação de Sentido Incorreto , Neoplasias/genética , Algoritmos , Transtorno do Espectro Autista/diagnóstico , Transtorno do Espectro Autista/genética , Cardiopatias Congênitas/diagnóstico , Cardiopatias Congênitas/genética , Humanos , Neoplasias/diagnóstico , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
15.
Sci Rep ; 6: 31484, 2016 08 16.
Artigo em Inglês | MEDLINE | ID: mdl-27526868

RESUMO

In this study, we present representative human contact networks among Chinese college students. Unlike schools in the US, human contacts within Chinese colleges are extremely clustered, partly due to the highly organized lifestyle of Chinese college students. Simulations of influenza spreading across real contact networks are in good accordance with real influenza records; however, epidemic simulations across idealized scale-free or small-world networks show considerable overestimation of disease prevalence, thus challenging the widely-applied idealized human contact models in epidemiology. Furthermore, the special contact pattern within Chinese colleges results in disease spreading patterns distinct from those of the US schools. Remarkably, class cancelation, though simple, shows a mitigating power equal to quarantine/vaccination applied on ~25% of college students, which quantitatively explains its success in Chinese colleges during the SARS period. Our findings greatly facilitate reliable prediction of epidemic prevalence, and thus should help establishing effective strategies for respiratory infectious diseases control.


Assuntos
Controle de Doenças Transmissíveis/métodos , Transmissão de Doença Infecciosa , Infecções Respiratórias/transmissão , China , Humanos , Infecções Respiratórias/prevenção & controle , Estudantes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA