RESUMO
MOTIVATION: Accurate prediction of protein structure relies heavily on exploiting multiple sequence alignment (MSA) for residue mutations and correlations as this information specifies protein tertiary structure. The widely used prediction approaches usually transform MSA into inter-mediate models, say position-specific scoring matrix or profile hidden Markov model. These inter-mediate models, however, cannot fully represent residue mutations and correlations carried by MSA; hence, an effective way to directly exploit MSAs is highly desirable. RESULTS: Here, we report a novel sequence set network (called Seq-SetNet) to directly and effectively exploit MSA for protein structure prediction. Seq-SetNet uses an 'encoding and aggregation' strategy that consists of two key elements: (i) an encoding module that takes a component homologue in MSA as input, and encodes residue mutations and correlations into context-specific features for each residue; and (ii) an aggregation module to aggregate the features extracted from all component homologues, which are further transformed into structural properties for residues of the query protein. As Seq-SetNet encodes each homologue protein individually, it could consider both insertions and deletions, as well as long-distance correlations among residues, thus representing more information than the inter-mediate models. Moreover, the encoding module automatically learns effective features and thus avoids manual feature engineering. Using symmetric aggregation functions, Seq-SetNet processes the homologue proteins as a sequence set, making its prediction results invariable to the order of these proteins. On popular benchmark sets, we demonstrated the successful application of Seq-SetNet to predict secondary structure and torsion angles of residues with improved accuracy and efficiency. AVAILABILITY AND IMPLEMENTATION: The code and datasets are available through https://github.com/fusong-ju/Seq-SetNet. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Proteínas , Software , Alinhamento de Sequência , Proteínas/genética , Proteínas/química , Estrutura Secundária de Proteína , Matrizes de Pontuação de Posição Específica , AlgoritmosRESUMO
BACKGROUND: Accurate prediction of protein tertiary structures is highly desired as the knowledge of protein structures provides invaluable insights into protein functions. We have designed two approaches to protein structure prediction, including a template-based modeling approach (called ProALIGN) and an ab initio prediction approach (called ProFOLD). Briefly speaking, ProALIGN aligns a target protein with templates through exploiting the patterns of context-specific alignment motifs and then builds the final structure with reference to the homologous templates. In contrast, ProFOLD uses an end-to-end neural network to estimate inter-residue distances of target proteins and builds structures that satisfy these distance constraints. These two approaches emphasize different characteristics of target proteins: ProALIGN exploits structure information of homologous templates of target proteins while ProFOLD exploits the co-evolutionary information carried by homologous protein sequences. Recent progress has shown that the combination of template-based modeling and ab initio approaches is promising. RESULTS: In the study, we present FALCON2, a web server that integrates ProALIGN and ProFOLD to provide high-quality protein structure prediction service. For a target protein, FALCON2 executes ProALIGN and ProFOLD simultaneously to predict possible structures and selects the most likely one as the final prediction result. We evaluated FALCON2 on widely-used benchmarks, including 104 CASP13 (the 13th Critical Assessment of protein Structure Prediction) targets and 91 CASP14 targets. In-depth examination suggests that when high-quality templates are available, ProALIGN is superior to ProFOLD and in other cases, ProFOLD shows better performance. By integrating these two approaches with different emphasis, FALCON2 server outperforms the two individual approaches and also achieves state-of-the-art performance compared with existing approaches. CONCLUSIONS: By integrating template-based modeling and ab initio approaches, FALCON2 provides an easy-to-use and high-quality protein structure prediction service for the community and we expect it to enable insights into a deep understanding of protein functions.
Assuntos
Redes Neurais de Computação , Proteínas , Sequência de Aminoácidos , Computadores , Conformação Proteica , SoftwareRESUMO
BACKGROUND: Optical maps record locations of specific enzyme recognition sites within long genome fragments. This long-distance information enables aligning genome assembly contigs onto optical maps and ordering contigs into scaffolds. The generated scaffolds, however, often contain a large amount of gaps. To fill these gaps, a feasible way is to search genome assembly graph for the best-matching contig paths that connect boundary contigs of gaps. The combination of searching and evaluation procedures might be "searching followed by evaluation", which is infeasible for long gaps, or "searching by evaluation", which heavily relies on heuristics and thus usually yields unreliable contig paths. RESULTS: We here report an accurate and efficient approach to filling gaps of genome scaffolds with aids of optical maps. Using simulated data from 12 species and real data from 3 species, we demonstrate the successful application of our approach in gap filling with improved accuracy and completeness of genome scaffolds. CONCLUSION: Our approach applies a sequential Bayesian updating technique to measure the similarity between optical maps and candidate contig paths. Using this similarity to guide path searching, our approach achieves higher accuracy than the existing "searching by evaluation" strategy that relies on heuristics. Furthermore, unlike the "searching followed by evaluation" strategy enumerating all possible paths, our approach prunes the unlikely sub-paths and extends the highly-probable ones only, thus significantly increasing searching efficiency.
Assuntos
Algoritmos , Genoma , Teorema de Bayes , Mapeamento de Sequências Contíguas , Mapeamento por Restrição , Análise de Sequência de DNARESUMO
BACKGROUND: The formation of contacts among protein secondary structure elements (SSEs) is an important step in protein folding as it determines topology of protein tertiary structure; hence, inferring inter-SSE contacts is crucial to protein structure prediction. One of the existing strategies infers inter-SSE contacts directly from the predicted possibilities of inter-residue contacts without any preprocessing, and thus suffers from the excessive noises existing in the predicted inter-residue contacts. Another strategy defines SSEs based on protein secondary structure prediction first, and then judges whether each candidate SSE pair could form contact or not. However, it is difficult to accurately determine boundary of SSEs due to the errors in secondary structure prediction. The incorrectly-deduced SSEs definitely hinder subsequent prediction of the contacts among them. RESULTS: We here report an accurate approach to infer the inter-SSE contacts (thus called as ISSEC) using the deep object detection technique. The design of ISSEC is based on the observation that, in the inter-residue contact map, the contacting SSEs usually form rectangle regions with characteristic patterns. Therefore, ISSEC infers inter-SSE contacts through detecting such rectangle regions. Unlike the existing approach directly using the predicted probabilities of inter-residue contact, ISSEC applies the deep convolution technique to extract high-level features from the inter-residue contacts. More importantly, ISSEC does not rely on the pre-defined SSEs. Instead, ISSEC enumerates multiple candidate rectangle regions in the predicted inter-residue contact map, and for each region, ISSEC calculates a confidence score to measure whether it has characteristic patterns or not. ISSEC employs greedy strategy to select non-overlapping regions with high confidence score, and finally infers inter-SSE contacts according to these regions. CONCLUSIONS: Comprehensive experimental results suggested that ISSEC outperformed the state-of-the-art approaches in predicting inter-SSE contacts. We further demonstrated the successful applications of ISSEC to improve prediction of both inter-residue contacts and tertiary structure as well.
Assuntos
Algoritmos , Proteínas/química , Bases de Dados de Proteínas , Proteínas de Membrana/química , Conformação Proteica em Folha beta , Estrutura Secundária de ProteínaRESUMO
BACKGROUND: Accurate prediction of inter-residue contacts of a protein is important to calculating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective in inferring inter-residue contacts. The Markov random field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate; in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccurate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. RESULTS: In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite-likelihood, i.e., the product of conditional probability of all residue pairs. Composite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, including PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction accuracy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present a successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset. CONCLUSIONS: Composite likelihood maximization algorithm can efficiently estimate the parameters of Markov Random Fields and can improve the prediction accuracy of protein inter-residue contacts.
Assuntos
Aprendizado Profundo , Proteínas/química , Algoritmos , ProbabilidadeRESUMO
Following publication of the original article [1], the author explained that there are several errors in the original article.
RESUMO
Pseudoknots are key structure motifs of RNA and pseudoknotted RNAs play important roles in a variety of biological processes. Here, we present KnotFold, an accurate approach to the prediction of RNA secondary structure including pseudoknots. The key elements of KnotFold include a learned potential function and a minimum-cost flow algorithm to find the secondary structure with the lowest potential. KnotFold learns the potential from the RNAs with known structures using an attention-based neural network, thus avoiding the inaccuracy of hand-crafted energy functions. The specially designed minimum-cost flow algorithm used by KnotFold considers all possible combinations of base pairs and selects from them the optimal combination. The algorithm breaks the restriction of nested base pairs required by the widely used dynamic programming algorithms, thus enabling the identification of pseudoknots. Using 1,009 pseudoknotted RNAs as representatives, we demonstrate the successful application of KnotFold in predicting RNA secondary structures including pseudoknots with accuracy higher than the state-of-the-art approaches. We anticipate that KnotFold, with its superior accuracy, will greatly facilitate the understanding of RNA structures and functionalities.
Assuntos
Algoritmos , RNA , RNA/genética , Conformação de Ácido Nucleico , Pareamento de Bases , Redes Neurais de ComputaçãoRESUMO
Protein functions are tightly related to the fine details of their 3D structures. To understand protein structures, computational prediction approaches are highly needed. Recently, protein structure prediction has achieved considerable progresses mainly due to the increased accuracy of inter-residue distance estimation and the application of deep learning techniques. Most of the distance-based ab initio prediction approaches adopt a two-step diagram: constructing a potential function based on the estimated inter-residue distances, and then build a 3D structure that minimizes the potential function. These approaches have proven very promising; however, they still suffer from several limitations, especially the inaccuracies incurred by the handcrafted potential function. Here, we present SASA-Net, a deep learning-based approach that directly learns protein 3D structure from the estimated inter-residue distances. Unlike the existing approach simply representing protein structures as coordinates of atoms, SASA-Net represents protein structures using pose of residues, i.e., the coordinate system of each individual residue in which all backbone atoms of this residue are fixed. The key element of SASA-Net is a spatial-aware self-attention mechanism, which is able to adjust a residue's pose according to all other residues' features and the estimated distances between residues. By iteratively applying the spatial-aware self-attention mechanism, SASA-Net continuously improves the structure and finally acquires a structure with high accuracy. Using the CATH35 proteins as representatives, we demonstrate that SASA-Net is able to accurately and efficiently build structures from the estimated inter-residue distances. The high accuracy and efficiency of SASA-Net enables an end-to-end neural network model for protein structure prediction through combining SASA-Net and an neural network for inter-residue distance prediction. Source code of SASA-Net is available at https://github.com/gongtiansu/SASA-Net/.
Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Proteínas/química , Redes Neurais de Computação , SoftwareRESUMO
Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.
Assuntos
Algoritmos , Proteínas , Conformação Proteica , Proteínas/química , Redes Neurais de Computação , Dobramento de Proteína , Biologia Computacional/métodosRESUMO
Linking cis-regulatory sequences to target genes has been a long-standing challenge. In this study, we introduce CREaTor, an attention-based deep neural network designed to model cis-regulatory patterns for genomic elements up to 2 Mb from target genes. Coupled with a training strategy that predicts gene expression from flanking candidate cis-regulatory elements (cCREs), CREaTor can model cell type-specific cis-regulatory patterns in new cell types without prior knowledge of cCRE-gene interactions or additional training. The zero-shot modeling capability, combined with the use of only RNA-seq and ChIP-seq data, allows for the ready generalization of CREaTor to a broad range of cell types.
Assuntos
Redes Neurais de Computação , Sequências Reguladoras de Ácido NucleicoRESUMO
Template-based modeling (TBM), including homology modeling and protein threading, is one of the most reliable techniques for protein structure prediction. It predicts protein structure by building an alignment between the query sequence under prediction and the templates with solved structures. However, it is still very challenging to build the optimal sequence-template alignment, especially when only distantly related templates are available. Here we report a novel deep learning approach ProALIGN that can predict much more accurate sequence-template alignment. Like protein sequences consisting of sequence motifs, protein alignments are also composed of frequently occurring alignment motifs with characteristic patterns. Alignment motifs are context-specific as their characteristic patterns are tightly related to sequence contexts of the aligned regions. Inspired by this observation, we represent a protein alignment as a binary matrix (in which 1 denotes an aligned residue pair) and then use a deep convolutional neural network to predict the optimal alignment from the query protein and its template. The trained neural network implicitly but effectively encodes an alignment scoring function, which reduces inaccuracies in the handcrafted scoring functions widely used by the current threading approaches. For a query protein and a template, we apply the neural network to directly infer likelihoods of all possible residue pairs in their entirety, which could effectively consider the correlations among multiple residues. We further construct the alignment with maximum likelihood, and finally build a structure model according to the alignment. Tested on three independent data sets with a total of 6688 protein alignment targets and 80 CASP13 TBM targets, our method achieved much better alignments and 3D structure models than the existing methods, including HHpred, CNFpred, CEthreader, and DeepThreader. These results clearly demonstrate the effectiveness of exploiting the context-specific alignment motifs by deep learning for protein threading.
Assuntos
Aprendizado Profundo , Proteínas/química , Alinhamento de Sequência/estatística & dados numéricos , Algoritmos , Motivos de Aminoácidos , Sequência de Aminoácidos , Biologia Computacional , Modelos Moleculares , Redes Neurais de Computação , Conformação Proteica , Proteínas/genética , Análise de Sequência de Proteína/estatística & dados numéricos , SoftwareRESUMO
Residue co-evolution has become the primary principle for estimating inter-residue distances of a protein, which are crucially important for predicting protein structure. Most existing approaches adopt an indirect strategy, i.e., inferring residue co-evolution based on some hand-crafted features, say, a covariance matrix, calculated from multiple sequence alignment (MSA) of target protein. This indirect strategy, however, cannot fully exploit the information carried by MSA. Here, we report an end-to-end deep neural network, CopulaNet, to estimate residue co-evolution directly from MSA. The key elements of CopulaNet include: (i) an encoder to model context-specific mutation for each residue; (ii) an aggregator to model residue co-evolution, and thereafter estimate inter-residue distances. Using CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrate that CopulaNet can predict protein structure with improved accuracy and efficiency. This study represents a step toward improved end-to-end prediction of inter-residue distances and protein tertiary structures.
Assuntos
Aprendizado de Máquina , Proteínas/química , Alinhamento de Sequência , Caspases/química , Biologia Computacional , Humanos , Modelos Moleculares , Mutação , Redes Neurais de Computação , Estrutura Terciária de Proteína , Proteínas/genéticaRESUMO
MOTIVATION: Glycans are large molecules with specific tree structures. Glycans play important roles in a great variety of biological processes. These roles are primarily determined by the fine details of their structures, making glycan structural identification highly desirable. Mass spectrometry (MS) has become the major technology for elucidation of glycan structures. Most de novo approaches to glycan structural identification from mass spectra fall into three categories: enumerating followed by filtering approaches, heuristic and dynamic programming-based approaches. The former suffers from its low efficiency while the latter two suffer from the possibility of missing the actual glycan structures. Thus, how to reliably and efficiently identify glycan structures from mass spectra still remains challenging. RESULTS: In this study we propose an efficient and reliable approach to glycan structure identification using tree merging strategy. Briefly, for each MS peak, our approach first calculated monosaccharide composition of its corresponding fragment ion, and then built a constraint that forces these monosaccharides to be directly connected in the underlying glycan tree structure. According to these connecting constraints, we next merged constituting monosaccharides of the glycan into a complete structure step by step. During this process, the intermediate structures were represented as subtrees, which were merged iteratively until a complete tree structure was generated. Finally the generated complete structures were ranked according to their compatibility to the input mass spectra. Unlike the traditional enumerating followed by filtering strategy, our approach performed deisomorphism to remove isomorphic subtrees, and ruled out invalid structures that violates the connection constraints at each tree merging step, thus significantly increasing efficiency. In addition, all complete structures satisfying the connection constraints were enumerated without any missing structure. Over a test set of 10 N-glycan standards, our approach accomplished structural identification in minutes and gave the manually-validated structure first three highest score. We further successfully applied our approach to profiling and subsequent structure assignment of glycans released from glycoprotein mAb, which was in perfect agreement with previous studies and CE analysis.