Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 66
Filtrar
1.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37930896

RESUMO

MOTIVATION: N-linked glycosylation is a frequently occurring post-translational protein modification that serves critical functions in protein folding, stability, trafficking, and recognition. Its involvement spans across multiple biological processes and alterations to this process can result in various diseases. Therefore, identifying N-linked glycosylation sites is imperative for comprehending the mechanisms and systems underlying glycosylation. Due to the inherent experimental complexities, machine learning and deep learning have become indispensable tools for predicting these sites. RESULTS: In this context, a new approach called EMNGly has been proposed. The EMNGly approach utilizes pretrained protein language model (Evolutionary Scale Modeling) and pretrained protein structure model (Inverse Folding Model) for features extraction and support vector machine for classification. Ten-fold cross-validation and independent tests show that this approach has outperformed existing techniques. And it achieves Matthews Correlation Coefficient, sensitivity, specificity, and accuracy of 0.8282, 0.9343, 0.8934, and 0.9143, respectively on a benchmark independent test set.


Assuntos
Processamento de Proteína Pós-Traducional , Proteínas , Glicosilação , Proteínas/química , Aprendizado de Máquina , Máquina de Vetores de Suporte , Biologia Computacional/métodos
2.
Bioinformatics ; 39(3)2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36916746

RESUMO

MOTIVATION: Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired. RESULTS: Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein. AVAILABILITY AND IMPLEMENTATION: The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.


Assuntos
Proteínas , Software , Sequência de Aminoácidos , Proteínas/química
3.
Anal Chem ; 95(2): 811-819, 2023 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-36547394

RESUMO

Accurate identification of glycan structures is highly desirable as they are intimately linked to their different functions. However, glycan samples generally exist as mixtures with multiple isomeric structures, making assignment of individual glycan components very challenging, even with the aid of multistage mass spectrometry (MSn). Here, we present an approach, GIPS-mix, for assignment of isomeric glycans within a mixture using an intelligent group-opting strategy. Our approach enumerates all possible combinations (groupings) of candidate glycans and opts in the best-matched glycan group(s) based on the similarity between the simulated spectra of each glycan group and the acquired experimental spectra of the mixture. In the case that a single group could not be elected, a tie break is performed by additional MSn scanning using intelligently selected precursors. With 11 standard mixtures and 6 human milk oligosaccharide fractions, we demonstrate the application of GIPS-mix in assignment of individual glycans in mixtures with high accuracy and efficiency.


Assuntos
Oligossacarídeos , Polissacarídeos , Humanos , Polissacarídeos/química , Oligossacarídeos/análise , Isomerismo , Leite Humano/química
4.
Bioinformatics ; 38(4): 990-996, 2022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-34849579

RESUMO

MOTIVATION: Accurate prediction of protein structure relies heavily on exploiting multiple sequence alignment (MSA) for residue mutations and correlations as this information specifies protein tertiary structure. The widely used prediction approaches usually transform MSA into inter-mediate models, say position-specific scoring matrix or profile hidden Markov model. These inter-mediate models, however, cannot fully represent residue mutations and correlations carried by MSA; hence, an effective way to directly exploit MSAs is highly desirable. RESULTS: Here, we report a novel sequence set network (called Seq-SetNet) to directly and effectively exploit MSA for protein structure prediction. Seq-SetNet uses an 'encoding and aggregation' strategy that consists of two key elements: (i) an encoding module that takes a component homologue in MSA as input, and encodes residue mutations and correlations into context-specific features for each residue; and (ii) an aggregation module to aggregate the features extracted from all component homologues, which are further transformed into structural properties for residues of the query protein. As Seq-SetNet encodes each homologue protein individually, it could consider both insertions and deletions, as well as long-distance correlations among residues, thus representing more information than the inter-mediate models. Moreover, the encoding module automatically learns effective features and thus avoids manual feature engineering. Using symmetric aggregation functions, Seq-SetNet processes the homologue proteins as a sequence set, making its prediction results invariable to the order of these proteins. On popular benchmark sets, we demonstrated the successful application of Seq-SetNet to predict secondary structure and torsion angles of residues with improved accuracy and efficiency. AVAILABILITY AND IMPLEMENTATION: The code and datasets are available through https://github.com/fusong-ju/Seq-SetNet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas , Software , Alinhamento de Sequência , Proteínas/genética , Proteínas/química , Estrutura Secundária de Proteína , Matrizes de Pontuação de Posição Específica , Algoritmos
5.
BMC Bioinformatics ; 22(1): 439, 2021 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-34525939

RESUMO

BACKGROUND: Accurate prediction of protein tertiary structures is highly desired as the knowledge of protein structures provides invaluable insights into protein functions. We have designed two approaches to protein structure prediction, including a template-based modeling approach (called ProALIGN) and an ab initio prediction approach (called ProFOLD). Briefly speaking, ProALIGN aligns a target protein with templates through exploiting the patterns of context-specific alignment motifs and then builds the final structure with reference to the homologous templates. In contrast, ProFOLD uses an end-to-end neural network to estimate inter-residue distances of target proteins and builds structures that satisfy these distance constraints. These two approaches emphasize different characteristics of target proteins: ProALIGN exploits structure information of homologous templates of target proteins while ProFOLD exploits the co-evolutionary information carried by homologous protein sequences. Recent progress has shown that the combination of template-based modeling and ab initio approaches is promising. RESULTS: In the study, we present FALCON2, a web server that integrates ProALIGN and ProFOLD to provide high-quality protein structure prediction service. For a target protein, FALCON2 executes ProALIGN and ProFOLD simultaneously to predict possible structures and selects the most likely one as the final prediction result. We evaluated FALCON2 on widely-used benchmarks, including 104 CASP13 (the 13th Critical Assessment of protein Structure Prediction) targets and 91 CASP14 targets. In-depth examination suggests that when high-quality templates are available, ProALIGN is superior to ProFOLD and in other cases, ProFOLD shows better performance. By integrating these two approaches with different emphasis, FALCON2 server outperforms the two individual approaches and also achieves state-of-the-art performance compared with existing approaches. CONCLUSIONS: By integrating template-based modeling and ab initio approaches, FALCON2 provides an easy-to-use and high-quality protein structure prediction service for the community and we expect it to enable insights into a deep understanding of protein functions.


Assuntos
Redes Neurais de Computação , Proteínas , Sequência de Aminoácidos , Computadores , Conformação Proteica , Software
6.
BMC Bioinformatics ; 22(1): 533, 2021 Oct 30.
Artigo em Inglês | MEDLINE | ID: mdl-34717539

RESUMO

BACKGROUND: Optical maps record locations of specific enzyme recognition sites within long genome fragments. This long-distance information enables aligning genome assembly contigs onto optical maps and ordering contigs into scaffolds. The generated scaffolds, however, often contain a large amount of gaps. To fill these gaps, a feasible way is to search genome assembly graph for the best-matching contig paths that connect boundary contigs of gaps. The combination of searching and evaluation procedures might be "searching followed by evaluation", which is infeasible for long gaps, or "searching by evaluation", which heavily relies on heuristics and thus usually yields unreliable contig paths. RESULTS: We here report an accurate and efficient approach to filling gaps of genome scaffolds with aids of optical maps. Using simulated data from 12 species and real data from 3 species, we demonstrate the successful application of our approach in gap filling with improved accuracy and completeness of genome scaffolds. CONCLUSION: Our approach applies a sequential Bayesian updating technique to measure the similarity between optical maps and candidate contig paths. Using this similarity to guide path searching, our approach achieves higher accuracy than the existing "searching by evaluation" strategy that relies on heuristics. Furthermore, unlike the "searching followed by evaluation" strategy enumerating all possible paths, our approach prunes the unlikely sub-paths and extends the highly-probable ones only, thus significantly increasing searching efficiency.


Assuntos
Algoritmos , Genoma , Teorema de Bayes , Mapeamento de Sequências Contíguas , Mapeamento por Restrição , Análise de Sequência de DNA
7.
BMC Bioinformatics ; 21(1): 503, 2020 Nov 05.
Artigo em Inglês | MEDLINE | ID: mdl-33153432

RESUMO

BACKGROUND: The formation of contacts among protein secondary structure elements (SSEs) is an important step in protein folding as it determines topology of protein tertiary structure; hence, inferring inter-SSE contacts is crucial to protein structure prediction. One of the existing strategies infers inter-SSE contacts directly from the predicted possibilities of inter-residue contacts without any preprocessing, and thus suffers from the excessive noises existing in the predicted inter-residue contacts. Another strategy defines SSEs based on protein secondary structure prediction first, and then judges whether each candidate SSE pair could form contact or not. However, it is difficult to accurately determine boundary of SSEs due to the errors in secondary structure prediction. The incorrectly-deduced SSEs definitely hinder subsequent prediction of the contacts among them. RESULTS: We here report an accurate approach to infer the inter-SSE contacts (thus called as ISSEC) using the deep object detection technique. The design of ISSEC is based on the observation that, in the inter-residue contact map, the contacting SSEs usually form rectangle regions with characteristic patterns. Therefore, ISSEC infers inter-SSE contacts through detecting such rectangle regions. Unlike the existing approach directly using the predicted probabilities of inter-residue contact, ISSEC applies the deep convolution technique to extract high-level features from the inter-residue contacts. More importantly, ISSEC does not rely on the pre-defined SSEs. Instead, ISSEC enumerates multiple candidate rectangle regions in the predicted inter-residue contact map, and for each region, ISSEC calculates a confidence score to measure whether it has characteristic patterns or not. ISSEC employs greedy strategy to select non-overlapping regions with high confidence score, and finally infers inter-SSE contacts according to these regions. CONCLUSIONS: Comprehensive experimental results suggested that ISSEC outperformed the state-of-the-art approaches in predicting inter-SSE contacts. We further demonstrated the successful applications of ISSEC to improve prediction of both inter-residue contacts and tertiary structure as well.


Assuntos
Algoritmos , Proteínas/química , Bases de Dados de Proteínas , Proteínas de Membrana/química , Conformação Proteica em Folha beta , Estrutura Secundária de Proteína
8.
Bioinformatics ; 35(14): i284-i294, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510699

RESUMO

MOTIVATION: Alternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: (i) unlike genes, isoform-level functional annotations are scarce. (ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. RESULTS: In this study, we present a novel approach, DIFFUSE (Deep learning-based prediction of IsoForm FUnctions from Sequences and Expression), to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average area under the receiver operating characteristics curve of 0.840 and area under the precision-recall curve of 0.581 over 4184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences. AVAILABILITY AND IMPLEMENTATION: https://github.com/haochenucr/DIFFUSE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Algoritmos , Processamento Alternativo
9.
Bioinformatics ; 35(17): 2991-2997, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30689704

RESUMO

MOTIVATION: Glycan identification has long been hampered by complicated branching patterns and various isomeric structures of glycans. Multistage mass spectrometry (MSn) is a promising glycan identification technique as it generates multiple-level fragments of a glycan, which can be explored to deduce branching pattern of the glycan and further distinguish it from other candidates with identical mass. However, the automatic glycan identification still remains a challenge since it mainly relies on expertise to guide a MSn instrument to generate spectra. RESULTS: Here, we proposed a novel method, named bestFSA, based on a best-first search algorithm to guide the process of spectrum producing in glycan identification using MSn. BestFSA is able to select the most appropriate peaks for next round of experiments and complete the identification using as few experimental rounds. Our analysis of seven representative glycans shows that bestFSA correctly distinguishes actual glycans efficiently and suggested bestFSA could be used in practical glycan identification. The combination of the MSn technology coupled with bestFSA should greatly facilitate the automatic identification of glycan branching patterns, with significantly improved identification sensitivity, and reduce time and cost of MSn experiments. AVAILABILITY AND IMPLEMENTATION: http://glycan.ict.ac.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Polissacarídeos , Espectrometria de Massas
10.
BMC Bioinformatics ; 20(Suppl 3): 135, 2019 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-30925867

RESUMO

BACKGROUND: The ab initio approaches to protein structure prediction usually employ the Monte Carlo technique to search the structural conformation that has the lowest energy. However, the widely-used energy functions are usually ineffective for conformation search. How to construct an effective energy function remains a challenging task. RESULTS: Here, we present a framework to construct effective energy functions for protein structure prediction. Unlike existing energy functions only requiring the native structure to be the lowest one, we attempt to maximize the attraction-basin where the native structure lies in the energy landscape. The underlying rationale is that each energy function determines a specific energy landscape together with a native attraction-basin, and the larger the attraction-basin is, the more likely for the Monte Carlo search procedure to find the native structure. Following this rationale, we constructed effective energy functions as follows: i) To explore the native attraction-basin determined by a certain energy function, we performed reverse Monte Carlo sampling starting from the native structure, identifying the structural conformations on the edge of attraction-basin. ii) To broaden the native attraction-basin, we smoothened the edge points of attraction-basin through tuning weights of energy terms, thus acquiring an improved energy function. Our framework alternates the broadening attraction-basin and reverse sampling steps (thus called BARS) until the native attraction-basin is sufficiently large. We present extensive experimental results to show that using the BARS framework, the constructed energy functions could greatly facilitate protein structure prediction in improving the quality of predicted structures and speeding up conformation search. CONCLUSION: Using the BARS framework, we constructed effective energy functions for protein structure prediction, which could improve the quality of predicted structures and speed up conformation search as well.


Assuntos
Biologia Computacional/métodos , Método de Monte Carlo , Proteínas/química , Algoritmos , Bases de Dados de Proteínas , Conformação Proteica , Termodinâmica
11.
BMC Bioinformatics ; 20(1): 616, 2019 Nov 29.
Artigo em Inglês | MEDLINE | ID: mdl-31783729

RESUMO

Following publication of the original article [1], the author explained that there are several errors in the original article.

12.
BMC Bioinformatics ; 20(1): 537, 2019 Oct 29.
Artigo em Inglês | MEDLINE | ID: mdl-31664895

RESUMO

BACKGROUND: Accurate prediction of inter-residue contacts of a protein is important to calculating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective in inferring inter-residue contacts. The Markov random field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate; in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccurate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. RESULTS: In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite-likelihood, i.e., the product of conditional probability of all residue pairs. Composite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, including PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction accuracy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present a successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset. CONCLUSIONS: Composite likelihood maximization algorithm can efficiently estimate the parameters of Markov Random Fields and can improve the prediction accuracy of protein inter-residue contacts.


Assuntos
Aprendizado Profundo , Proteínas/química , Algoritmos , Probabilidade
13.
Bioinformatics ; 34(13): i263-i273, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949980

RESUMO

Motivation: Template-based modeling, including homology modeling and protein threading, is a popular method for protein 3D structure prediction. However, alignment generation and template selection for protein sequences without close templates remain very challenging. Results: We present a new method called DeepThreader to improve protein threading, including both alignment generation and template selection, by making use of deep learning (DL) and residue co-variation information. Our method first employs DL to predict inter-residue distance distribution from residue co-variation and sequential information (e.g. sequence profile and predicted secondary structure), and then builds sequence-template alignment by integrating predicted distance information and sequential features through an ADMM algorithm. Experimental results suggest that predicted inter-residue distance is helpful to both protein alignment and template selection especially for protein sequences without very close templates, and that our method outperforms currently popular homology modeling method HHpred and threading method CNFpred by a large margin and greatly outperforms the latest contact-assisted protein threading method EigenTHREADER. Availability and implementation: http://raptorx.uchicago.edu/. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Modelos Moleculares , Proteínas/metabolismo , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Estrutura Secundária de Proteína , Proteínas/química , Software
14.
Anal Chem ; 90(24): 14412-14422, 2018 12 18.
Artigo em Inglês | MEDLINE | ID: mdl-30444352

RESUMO

Glycans play important roles in a variety of biological processes. Their activities are closely related to the fine details of their structures. Unlike the simple linear chains of proteins, branching is a unique feature of glycan structures, making their identification extremely challenging. Multistage mass spectrometry (MS n) has become the primary method for glycan structural identification. The major difficulty for MS n is the selection of fragment ions as precursors for the next stage of scanning. Widely used strategies are either manual selection by experienced experts, which requires considerable expertise and time, or simply selecting the most intense peaks by which the product-ion spectrum generated may not be structurally informative and therefore fail to make the assignment. We here report a glycan "intelligent precursor selection" strategy (GIPS) to guide MS n experiments. Our approach consists of two key elements, an empirical model to calculate candidate glycan's probability and a statistical model to calculate fragment ion's distinguishing power in order to select the structurally most informative peak as the precursor for next-stage scanning. Using 15 glycan standards, including three pairs with isomeric sequences and eight variously fucosylated oligosaccharides on linear or branched hexasaccharide backbones isolated from a human milk oligosaccharide fraction by HPLC, we demonstrate its successful application to branching pattern analysis with improved efficiency and sensitivity and also the potential for automated operation.


Assuntos
Polissacarídeos/química , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz/métodos , Animais , Automação , Humanos , Leite , Oligossacarídeos/análise
15.
Bioinformatics ; 33(23): 3749-3757, 2017 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-28961795

RESUMO

MOTIVATION: Accurate recognition of protein fold types is a key step for template-based prediction of protein structures. The existing approaches to fold recognition mainly exploit the features derived from alignments of query protein against templates. These approaches have been shown to be successful for fold recognition at family level, but usually failed at superfamily/fold levels. To overcome this limitation, one of the key points is to explore more structurally informative features of proteins. Although residue-residue contacts carry abundant structural information, how to thoroughly exploit these information for fold recognition still remains a challenge. RESULTS: In this study, we present an approach (called DeepFR) to improve fold recognition at superfamily/fold levels. The basic idea of our approach is to extract fold-specific features from predicted residue-residue contacts of proteins using deep convolutional neural network (DCNN) technique. Based on these fold-specific features, we calculated similarity between query protein and templates, and then assigned query protein with fold type of the most similar template. DCNN has showed excellent performance in image feature extraction and image recognition; the rational underlying the application of DCNN for fold recognition is that contact likelihood maps are essentially analogy to images, as they both display compositional hierarchy. Experimental results on the LINDAHL dataset suggest that even using the extracted fold-specific features alone, our approach achieved success rate comparable to the state-of-the-art approaches. When further combining these features with traditional alignment-related features, the success rate of our approach increased to 92.3%, 82.5% and 78.8% at family, superfamily and fold levels, respectively, which is about 18% higher than the state-of-the-art approach at fold level, 6% higher at superfamily level and 1% higher at family level. An independent assessment on SCOP_TEST dataset showed consistent performance improvement, indicating robustness of our approach. Furthermore, bi-clustering results of the extracted features are compatible with fold hierarchy of proteins, implying that these features are fold-specific. Together, these results suggest that the features extracted from predicted contacts are orthogonal to alignment-related features, and the combination of them could greatly facilitate fold recognition at superfamily/fold levels and template-based prediction of protein structures. AVAILABILITY AND IMPLEMENTATION: Source code of DeepFR is freely available through https://github.com/zhujianwei31415/deepfr, and a web server is available through http://protein.ict.ac.cn/deepfr. CONTACT: zheng@itp.ac.cn or dbu@ict.ac.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Dobramento de Proteína , Algoritmos , Redes Neurais de Computação , Proteínas/química , Software
16.
BMC Bioinformatics ; 18(Suppl 3): 70, 2017 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-28361691

RESUMO

BACKGROUND: Residues in a protein might be buried inside or exposed to the solvent surrounding the protein. The buried residues usually form hydrophobic cores to maintain the structural integrity of proteins while the exposed residues are tightly related to protein functions. Thus, the accurate prediction of solvent accessibility of residues will greatly facilitate our understanding of both structure and functionalities of proteins. Most of the state-of-the-art prediction approaches consider the burial state of each residue independently, thus neglecting the correlations among residues. RESULTS: In this study, we present a high-order conditional random field model that considers burial states of all residues in a protein simultaneously. Our approach exploits not only the correlation among adjacent residues but also the correlation among long-range residues. Experimental results showed that by exploiting the correlation among residues, our approach outperformed the state-of-the-art approaches in prediction accuracy. In-depth case studies also showed that by using the high-order statistical model, the errors committed by the bidirectional recurrent neural network and chain conditional random field models were successfully corrected. CONCLUSIONS: Our methods enable the accurate prediction of residue burial states, which should greatly facilitate protein structure prediction and evaluation.


Assuntos
Modelos Teóricos , Proteínas/química , Bases de Dados Factuais , Interações Hidrofóbicas e Hidrofílicas , Conformação Proteica , Reprodutibilidade dos Testes , Solventes/química
17.
Bioinformatics ; 32(3): 462-4, 2016 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-26454278

RESUMO

SUMMARY: The protein structure prediction approaches can be categorized into template-based modeling (including homology modeling and threading) and free modeling. However, the existing threading tools perform poorly on remote homologous proteins. Thus, improving fold recognition for remote homologous proteins remains a challenge. Besides, the proteome-wide structure prediction poses another challenge of increasing prediction throughput. In this study, we presented FALCON@home as a protein structure prediction server focusing on remote homologue identification. The design of FALCON@home is based on the observation that a structural template, especially for remote homologous proteins, consists of conserved regions interweaved with highly variable regions. The highly variable regions lead to vague alignments in threading approaches. Thus, FALCON@home first extracts conserved regions from each template and then aligns a query protein with conserved regions only rather than the full-length template directly. This helps avoid the vague alignments rooted in highly variable regions, improving remote homologue identification. We implemented FALCON@home using the Berkeley Open Infrastructure of Network Computing (BOINC) volunteer computing protocol. With computation power donated from over 20,000 volunteer CPUs, FALCON@home shows a throughput as high as processing of over 1000 proteins per day. In the Critical Assessment of protein Structure Prediction (CASP11), the FALCON@home-based prediction was ranked the 12th in the template-based modeling category. As an application, the structures of 880 mouse mitochondria proteins were predicted, which revealed the significant correlation between protein half-lives and protein structural factors. AVAILABILITY AND IMPLEMENTATION: FALCON@home is freely available at http://protein.ict.ac.cn/FALCON/. CONTACT: shuaicli@cityu.edu.hk, dbu@ict.ac.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Conformação Proteica , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Software , Animais , Biologia Computacional/métodos , Bases de Dados de Proteínas , Ensaios de Triagem em Larga Escala , Camundongos
18.
Biochem Biophys Res Commun ; 472(1): 217-22, 2016 Mar 25.
Artigo em Inglês | MEDLINE | ID: mdl-26920058

RESUMO

Strategies for correlation analysis in protein contact prediction often encounter two challenges, namely, the indirect coupling among residues, and the background correlations mainly caused by phylogenetic biases. While various studies have been conducted on how to disentangle indirect coupling, the removal of background correlations still remains unresolved. Here, we present an approach for removing background correlations via low-rank and sparse decomposition (LRS) of a residue correlation matrix. The correlation matrix can be constructed using either local inference strategies (e.g., mutual information, or MI) or global inference strategies (e.g., direct coupling analysis, or DCA). In our approach, a correlation matrix was decomposed into two components, i.e., a low-rank component representing background correlations, and a sparse component representing true correlations. Finally the residue contacts were inferred from the sparse component of correlation matrix. We trained our LRS-based method on the PSICOV dataset, and tested it on both GREMLIN and CASP11 datasets. Our experimental results suggested that LRS significantly improves the contact prediction precision. For example, when equipped with the LRS technique, the prediction precision of MI and mfDCA increased from 0.25 to 0.67 and from 0.58 to 0.70, respectively (Top L/10 predicted contacts, sequence separation: 5 AA, dataset: GREMLIN). In addition, our LRS technique also consistently outperforms the popular denoising technique APC (average product correction), on both local (MI_LRS: 0.67 vs MI_APC: 0.34) and global measures (mfDCA_LRS: 0.70 vs mfDCA_APC: 0.67). Interestingly, we found out that when equipped with our LRS technique, local inference strategies performed in a comparable manner to that of global inference strategies, implying that the application of LRS technique narrowed down the performance gap between local and global inference strategies. Overall, our LRS technique greatly facilitates protein contact prediction by removing background correlations. An implementation of the approach called COLORS (improving COntact prediction using LOw-Rank and Sparse matrix decomposition) is available from http://protein.ict.ac.cn/COLORS/.


Assuntos
Domínios e Motivos de Interação entre Proteínas , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Simulação por Computador , Bases de Dados de Proteínas , Evolução Molecular , Modelos Moleculares , Modelos Estatísticos , Filogenia , Análise de Componente Principal , Conformação Proteica , Dobramento de Proteína , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Mapas de Interação de Proteínas , Análise de Sequência de Proteína
19.
BMC Bioinformatics ; 16: 110, 2015 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-25887925

RESUMO

BACKGROUND: Tandem mass spectrometry (MS/MS) acts as a key technique for peptide identification. The MS/MS-based peptide identification approaches can be categorized into two families, namely, de novo and database search. Both of the two types of approaches can benefit from an accurate prediction of theoretical spectrum. A theoretical spectrum consists of m/z and intensity of possibly occurring ions, which are estimated via simulating the spectrum generating process. Extensive researches have been conducted for theoretical spectrum prediction; however, the prediction methods suffer from low prediciton accuracy due to oversimplifications in the spectrum simulation process. RESULTS: In the study, we present an open-source software package, called OpenMS-Simulator, to predict theoretical spectrum for a given peptide sequence. Based on the mobile-proton hypothesis for peptide fragmentation, OpenMS-Simulator trained a closed-form model for the intensity ratio of adjacent y ions, from which the whole theoretical spectrum can be constructed. On a collection of representative spectra datasets with annotated peptide sequences, experimental results suggest that OpenMS-Simulator can predict theoretical spectra with considerable accuracy. The study also presents an application of OpenMS-Simulator: the similarity between theoretical spectra and query spectra can be used to re-rank the peptide sequence reported by SEQUEST/X!Tandem. CONCLUSIONS: OpenMS-Simulator implements a novel model to predict theoretical spectrum for a given peptide sequence. Compared with existing theoretical spectrum prediction tools, say MassAnalyzer and MSSimulator, our method not only simplifies the computation process, but also improves the prediction accuracy. Currently, OpenMS-Simulator supports the prediction of CID and HCD spectrum for peptides with double charges. The extension to cover more fragmentation models and support multiple-charged peptides remains as one of the future works.


Assuntos
Bases de Dados Factuais , Modelos Teóricos , Fragmentos de Peptídeos/análise , Proteínas/análise , Software , Espectrometria de Massas em Tandem/métodos , Algoritmos , Simulação por Computador , Humanos
20.
BMC Bioinformatics ; 16 Suppl 18: S15, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26681607

RESUMO

BACKGROUND: In recent years, high throughput and non-invasive Raman spectrometry technique has matured as an effective approach to identification of individual cells by species, even in complex, mixed populations. Raman profiling is an appealing optical microscopic method to achieve this. To fully utilize Raman proling for single-cell analysis, an extensive understanding of Raman spectra is necessary to answer questions such as which filtering methodologies are effective for pre-processing of Raman spectra, what strains can be distinguished by Raman spectra, and what features serve best as Raman-based biomarkers for single-cells, etc. RESULTS: In this work, we have proposed an approach called rDisc to discretize the original Raman spectrum into only a few (usually less than 20) representative peaks (Raman shifts). The approach has advantages in removing noises, and condensing the original spectrum. In particular, effective signal processing procedures were designed to eliminate noise, utilising wavelet transform denoising, baseline correction, and signal normalization. In the discretizing process, representative peaks were selected to signicantly decrease the Raman data size. More importantly, the selected peaks are chosen as suitable to serve as key biological markers to differentiate species and other cellular features. Additionally, the classication performance of discretized spectra was found to be comparable to full spectrum having more than 1000 Raman shifts. Overall, the discretized spectrum needs about 5storage space of a full spectrum and the processing speed is considerably faster. This makes rDisc clearly superior to other methods for single-cell classication.


Assuntos
Bactérias/química , Análise Espectral Raman , Bases de Dados Factuais , Análise Discriminante , Fenótipo , Análise de Componente Principal , Processamento de Sinais Assistido por Computador , Análise de Célula Única
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA