Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
1.
Nat Rev Mol Cell Biol ; 23(1): 40-55, 2022 01.
Article in English | MEDLINE | ID: mdl-34518686

ABSTRACT

The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.


Subject(s)
Biology , Machine Learning , Animals , Deep Learning , Humans , Neural Networks, Computer
2.
Nucleic Acids Res ; 52(W1): W287-W293, 2024 Jul 05.
Article in English | MEDLINE | ID: mdl-38747351

ABSTRACT

The PSIRED Workbench is a long established and popular bioinformatics web service offering a wide range of machine learning based analyses for characterizing protein structure and function. In this paper we provide an update of the recent additions and developments to the webserver, with a focus on new Deep Learning based methods. We briefly discuss some trends in server usage since the publication of AlphaFold2 and we give an overview of some upcoming developments for the service. The PSIPRED Workbench is available at http://bioinf.cs.ucl.ac.uk/psipred.


Subject(s)
Deep Learning , Proteins , Software , Proteins/chemistry , Proteins/genetics , Internet , Protein Conformation , Computational Biology/methods , Sequence Analysis, Protein/methods
3.
Proc Natl Acad Sci U S A ; 119(4)2022 01 25.
Article in English | MEDLINE | ID: mdl-35074909

ABSTRACT

Deep learning-based prediction of protein structure usually begins by constructing a multiple sequence alignment (MSA) containing homologs of the target protein. The most successful approaches combine large feature sets derived from MSAs, and considerable computational effort is spent deriving these input features. We present a method that greatly reduces the amount of preprocessing required for a target MSA, while producing main chain coordinates as a direct output of a deep neural network. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model. This level of accuracy and speed allows very large-scale three-dimensional modeling of proteins on minimal hardware, and we demonstrate this by producing models for over 1.3 million uncharacterized regions of proteins extracted from the BFD sequence clusters. After constructing an initial set of approximate models, we select a confident subset of over 30,000 models for further refinement and analysis, revealing putative novel protein folds. We also provide updated models for over 5,000 Pfam families studied in the original DMPfold paper.


Subject(s)
Models, Molecular , Protein Conformation , Software , Algorithms , Caspases/chemistry , Computational Biology , Databases, Protein , Deep Learning , High-Throughput Screening Assays , Proteins/chemistry
4.
Proteins ; 87(12): 1092-1099, 2019 12.
Article in English | MEDLINE | ID: mdl-31298436

ABSTRACT

In this article, we describe our efforts in contact prediction in the CASP13 experiment. We employed a new deep learning-based contact prediction tool, DeepMetaPSICOV (or DMP for short), together with new methods and data sources for alignment generation. DMP evolved from MetaPSICOV and DeepCov and combines the input feature sets used by these methods as input to a deep, fully convolutional residual neural network. We also improved our method for multiple sequence alignment generation and included metagenomic sequences in the search. We discuss successes and failures of our approach and identify areas where further improvements may be possible. DMP is freely available at: https://github.com/psipred/DeepMetaPSICOV.


Subject(s)
Computational Biology , Protein Conformation , Proteins/ultrastructure , Algorithms , Amino Acid Sequence/genetics , Deep Learning , Machine Learning , Metagenome/genetics , Neural Networks, Computer , Proteins/chemistry , Proteins/genetics , Sequence Analysis, Protein
5.
Proteins ; 87(12): 1179-1189, 2019 12.
Article in English | MEDLINE | ID: mdl-31589782

ABSTRACT

Although many structural bioinformatics tools have been using neural network models for a long time, deep neural network (DNN) models have attracted considerable interest in recent years. Methods employing DNNs have had a significant impact in recent CASP experiments, notably in CASP12 and especially CASP13. In this article, we offer a brief introduction to some of the key principles and properties of DNN models and discuss why they are naturally suited to certain problems in structural bioinformatics. We also briefly discuss methodological improvements that have enabled these successes. Using the contact prediction task as an example, we also speculate why DNN models are able to produce reasonably accurate predictions even in the absence of many homologues for a given target sequence, a result that can at first glance appear surprising given the lack of input information. We end on some thoughts about how and why these types of models can be so effective, as well as a discussion on potential pitfalls.


Subject(s)
Computational Biology , Deep Learning , Protein Conformation , Models, Molecular , Neural Networks, Computer , Proteins/chemistry , Proteins/genetics , Proteins/ultrastructure , Structural Homology, Protein
6.
Bioinformatics ; 34(19): 3308-3315, 2018 10 01.
Article in English | MEDLINE | ID: mdl-29718112

ABSTRACT

Motivation: In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Results: Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. Availability and implementation: DeepCov is freely available at https://github.com/psipred/DeepCov. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Neural Networks, Computer , Protein Interaction Domains and Motifs , Proteins/chemistry , Amino Acid Sequence , Computational Biology , Sequence Alignment
7.
Proteins ; 84(4): 411-26, 2016 Apr.
Article in English | MEDLINE | ID: mdl-26799916

ABSTRACT

Energy functions, fragment libraries, and search methods constitute three key components of fragment-assembly methods for protein structure prediction, which are all crucial for their ability to generate high-accuracy predictions. All of these components are tightly coupled; efficient searching becomes more important as the quality of fragment libraries decreases. Given these relationships, there is currently a poor understanding of the strengths and weaknesses of the sampling approaches currently used in fragment-assembly techniques. Here, we determine how the performance of search techniques can be assessed in a meaningful manner, given the above problems. We describe a set of techniques that aim to reduce the impact of the energy function, and assess exploration in view of the search space defined by a given fragment library. We illustrate our approach using Rosetta and EdaFold, and show how certain features of these methods encourage or limit conformational exploration. We demonstrate that individual trajectories of Rosetta are susceptible to local minima in the energy landscape, and that this can be linked to non-uniform sampling across the protein chain. We show that EdaFold's novel approach can help balance broad exploration with locating good low-energy conformations. This occurs through two mechanisms which cannot be readily differentiated using standard performance measures: exclusion of false minima, followed by an increasingly focused search in low-energy regions of conformational space. Measures such as ours can be helpful in characterizing new fragment-based methods in terms of the quality of conformational exploration realized.


Subject(s)
Algorithms , Gene Library , Peptide Fragments/chemistry , Computer Simulation , Models, Molecular , Peptide Fragments/genetics , Protein Conformation , Protein Folding , Thermodynamics
8.
Evol Comput ; 24(4): 577-607, 2016.
Article in English | MEDLINE | ID: mdl-26908350

ABSTRACT

Computational approaches to de novo protein tertiary structure prediction, including those based on the preeminent "fragment-assembly" technique, have failed to scale up fully to larger proteins (on the order of 100 residues and above). A number of limiting factors are thought to contribute to the scaling problem over and above the simple combinatorial explosion, but the key ones relate to the lack of exploration of properly diverse protein folds, and to an acute form of "deception" in the energy function, whereby low-energy conformations do not reliably equate with native structures. In this article, solutions to both of these problems are investigated through a multistage memetic algorithm incorporating the successful Rosetta method as a local search routine. We found that specialised genetic operators significantly add to structural diversity and that this translates well to reaching low energies. The use of a generalised stochastic ranking procedure for selection enables the memetic algorithm to handle and traverse deep energy wells that can be considered deceptive, which further adds to the ability of the algorithm to obtain a much-improved diversity of folds. The results should translate to a tangible improvement in the performance of protein structure prediction algorithms in blind experiments such as CASP, and potentially to a further step towards the more challenging problem of predicting the three-dimensional shape of large proteins.


Subject(s)
Algorithms , Proteins/chemistry , Computational Biology , Evolution, Molecular , Molecular Dynamics Simulation , Peptide Fragments/chemistry , Peptide Fragments/genetics , Protein Conformation , Protein Structure, Secondary , Protein Structure, Tertiary , Proteins/genetics , Stochastic Processes
9.
Phys Chem Chem Phys ; 16(6): 2256-9, 2014 Feb 14.
Article in English | MEDLINE | ID: mdl-24394921

ABSTRACT

A combination of the temperature- and pressure-dependencies of the kinetic isotope effect on the proton coupled electron transfer during ascorbate oxidation by ferricyanide suggests that this reference reaction may exploit vibrationally assisted quantum tunnelling of the transferred proton.


Subject(s)
Ascorbic Acid/chemistry , Ferricyanides/chemistry , Protons , Electron Transport , Kinetics , Oxidation-Reduction , Pressure , Temperature
10.
J Comput Chem ; 34(21): 1850-61, 2013 Aug 05.
Article in English | MEDLINE | ID: mdl-23720381

ABSTRACT

We propose a generic method to model polarization in the context of high-rank multipolar electrostatics. This method involves the machine learning technique kriging, here used to capture the response of an atomic multipole moment of a given atom to a change in the positions of the atoms surrounding this atom. The atoms are malleable boxes with sharp boundaries, they do not overlap and exhaust space. The method is applied to histidine where it is able to predict atomic multipole moments (up to hexadecapole) for unseen configurations, after training on 600 geometries distorted using normal modes of each of its 24 local energy minima at B3LYP/apc-1 level. The quality of the predictions is assessed by calculating the Coulomb energy between an atom for which the moments have been predicted and the surrounding atoms (having exact moments). Only interactions between atoms separated by three or more bonds ("1, 4 and higher" interactions) are included in this energy error. This energy is compared with that of a central atom with exact multipole moments interacting with the same environment. The resulting energy discrepancies are summed for 328 atom-atom interactions, for each of the 29 atoms of histidine being a central atom in turn. For 80% of the 539 test configurations (outside the training set), this summed energy deviates by less than 1 kcal mol(-1).


Subject(s)
Histidine/chemistry , Models, Chemical , Peptides/chemistry , Molecular Conformation , Static Electricity
11.
Curr Opin Struct Biol ; 81: 102627, 2023 08.
Article in English | MEDLINE | ID: mdl-37320955

ABSTRACT

Recent breakthroughs in protein structure prediction have increasingly relied on the use of deep neural networks. These recent methods are notable in that they produce 3-D atomic coordinates as a direct output of the networks, a feature which presents many advantages. Although most techniques of this type make use of multiple sequence alignments as their primary input, a new wave of methods have attempted to use just single sequences as the input. We discuss the make-up and operating principles of these models, and highlight new developments in these areas, as well as areas for future development.


Subject(s)
Machine Learning , Proteins , Proteins/chemistry , Neural Networks, Computer , Sequence Alignment
12.
Nat Commun ; 14(1): 8445, 2023 Dec 19.
Article in English | MEDLINE | ID: mdl-38114456

ABSTRACT

The AlphaFold Protein Structure Database, containing predictions for over 200 million proteins, has been met with enthusiasm over its potential in enriching structural biological research and beyond. Currently, access to the database is precluded by an urgent need for tools that allow the efficient traversal, discovery, and documentation of its contents. Identifying domain regions in the database is a non-trivial endeavour and doing so will aid our understanding of protein structure and function, while facilitating drug discovery and comparative genomics. Here, we describe a deep learning method for domain segmentation called Merizo, which learns to cluster residues into domains in a bottom-up manner. Merizo is trained on CATH domains and fine-tuned on AlphaFold2 models via self-distillation, enabling it to be applied to both experimental and AlphaFold2 models. As proof of concept, we apply Merizo to the human proteome, identifying 40,818 putative domains that can be matched to CATH representative domains.


Subject(s)
Genomics , Proteins , Humans , Protein Domains , Protein Structure, Tertiary , Proteins/genetics , Proteins/chemistry , Databases, Protein
13.
Nat Commun ; 10(1): 3977, 2019 09 04.
Article in English | MEDLINE | ID: mdl-31484923

ABSTRACT

The inapplicability of amino acid covariation methods to small protein families has limited their use for structural annotation of whole genomes. Recently, deep learning has shown promise in allowing accurate residue-residue contact prediction even for shallow sequence alignments. Here we introduce DMPfold, which uses deep learning to predict inter-atomic distance bounds, the main chain hydrogen bond network, and torsion angles, which it uses to build models in an iterative fashion. DMPfold produces more accurate models than two popular methods for a test set of CASP12 domains, and works just as well for transmembrane proteins. Applied to all Pfam domains without known structures, confident models for 25% of these so-called dark families were produced in under a week on a small 200 core cluster. DMPfold provides models for 16% of human proteome UniProt entries without structures, generates accurate models with fewer than 100 sequences in some cases, and is freely available.


Subject(s)
Computational Biology/methods , Deep Learning , Models, Molecular , Protein Conformation , Proteome/chemistry , Proteomics/methods , Algorithms , Animals , Binding Sites/genetics , Humans , Proteome/genetics , Proteome/metabolism , Reproducibility of Results
14.
Biomolecules ; 9(10)2019 10 15.
Article in English | MEDLINE | ID: mdl-31618996

ABSTRACT

Our previous work with fragment-assembly methods has demonstrated specific deficiencies in conformational sampling behaviour that, when addressed through improved sampling algorithms, can lead to more reliable prediction of tertiary protein structure when good fragments are available, and when score values can be relied upon to guide the search to the native basin. In this paper, we present preliminary investigations into two important questions arising from more difficult prediction problems. First, we investigated the extent to which native-like conformational states are generated during multiple runs of our search protocols. We determined that, in cases of difficult prediction, native-like decoys are rarely or never generated. Second, we developed a scheme for decoy retention that balances the objectives of retaining low-scoring structures and retaining conformationally diverse structures sampled during the course of the search. Our method succeeds at retaining more diverse sets of structures, and, for a few targets, more native-like solutions are retained as compared to our original, energy-based retention scheme. However, in general, we found that the rate at which native-like structural states are generated has a much stronger effect on eventual distributions of predictive accuracy in the decoy sets, as compared to the specific decoy retention strategy used. We found that our protocols show differences in their ability to access native-like states for some targets, and this may explain some of the differences in predictive performance seen between these methods. There appears to be an interaction between fragment sets and move operators, which influences the accessibility of native-like structures for given targets. Our results point to clear directions for further improvements in fragment-based methods, which are likely to enable higher accuracy predictions.


Subject(s)
Proteins/chemistry , Algorithms , Protein Conformation , Thermodynamics
15.
Sci Rep ; 9(1): 7083, 2019 05 08.
Article in English | MEDLINE | ID: mdl-31068650

ABSTRACT

RAS genotyping is mandatory to predict anti-EGFR monoclonal antibodies (mAbs) therapy resistance and BRAF genotyping is a relevant prognosis marker in patients with metastatic colorectal cancer. Although the role of hotspot mutations is well defined, the impact of uncommon mutations is still unknown. In this study, we aimed to discuss the potential utility of detecting uncommon RAS and BRAF mutation profiles with next-generation sequencing. A total of 779 FFPE samples from patients with metastatic colorectal cancer with valid NGS results were screened and 22 uncommon mutational profiles of KRAS, NRAS and BRAF genes were selected. In silico prediction of mutation impact was then assessed by 2 predictive scores and a structural protein modelling. Three samples carry a single KRAS non-hotspot mutation, one a single NRAS non-hotspot mutation, four a single BRAF non-hotspot mutation and fourteen carry several mutations. This in silico study shows that some non-hotspot RAS mutations seem to behave like hotspot mutations and warrant further examination to assess whether they should confer a resistance to anti-EGFR mAbs therapy for patients bearing these non-hotspot RAS mutations. For BRAF gene, non-V600E mutations may characterise a novel subtype of mCRC with better prognosis, potentially implying a modification of therapeutic strategy.


Subject(s)
Colorectal Neoplasms/genetics , Diagnostic Tests, Routine/methods , Genotype , Genotyping Techniques/methods , High-Throughput Nucleotide Sequencing/methods , Mutation , Antineoplastic Agents, Immunological/pharmacology , Antineoplastic Agents, Immunological/therapeutic use , Colorectal Neoplasms/drug therapy , Computer Simulation , Drug Resistance, Neoplasm/genetics , ErbB Receptors/antagonists & inhibitors , GTP Phosphohydrolases/genetics , Humans , Membrane Proteins/genetics , Neoplasm Metastasis/genetics , Polymorphism, Single Nucleotide , Proto-Oncogene Proteins B-raf/genetics , Proto-Oncogene Proteins p21(ras)/genetics , Retrospective Studies
16.
Sci Rep ; 8(1): 13694, 2018 09 12.
Article in English | MEDLINE | ID: mdl-30209258

ABSTRACT

Difficulty in sampling large and complex conformational spaces remains a key limitation in fragment-based de novo prediction of protein structure. Our previous work has shown that even for small-to-medium-sized proteins, some current methods inadequately sample alternative structures. We have developed two new conformational sampling techniques, one employing a bilevel optimisation framework and the other employing iterated local search. We combine strategies of forced structural perturbation (where some fragment insertions are accepted regardless of their impact on scores) and greedy local optimisation, allowing greater exploration of the available conformational space. Comparisons against the Rosetta Abinitio method indicate that our protocols more frequently generate native-like predictions for many targets, even following the low-resolution phase, using a given set of fragment libraries. By contrasting results across two different fragment sets, we show that our methods are able to better take advantage of high-quality fragments. These improvements can also translate into more reliable identification of near-native structures in a simple clustering-based model selection procedure. We show that when fragment libraries are sufficiently well-constructed, improved breadth of exploration within runs improves prediction accuracy. Our results also suggest that in benchmarking scenarios, a total exclusion of fragments drawn from homologous templates can make performance differences between methods appear less pronounced.


Subject(s)
Peptide Fragments/chemistry , Proteins/chemistry , Benchmarking/methods , Cluster Analysis , Computer Simulation , Heuristics , Models, Molecular , Protein Conformation
17.
Virus Evol ; 3(2): vex019, 2017 Jul.
Article in English | MEDLINE | ID: mdl-28852572

ABSTRACT

Despite the use of combination antiretroviral drugs for the treatment of HIV-1 infection, the emergence of drug resistance remains a problem. Resistance may be conferred either by a single mutation or a concerted set of mutations. The involvement of multiple mutations can arise due to interactions between sites in the amino acid sequence as a consequence of the need to maintain protein structure. To better understand the nature of such epistatic interactions, we reconstructed the ancestral sequences of HIV-1's Pol protein, and traced the evolutionary trajectories leading to mutations associated with drug resistance. Using contemporary and ancestral sequences we modelled the effects of mutations (i.e. amino acid replacements) on protein structure to understand the functional effects of residue changes. Although the majority of resistance-associated sequences tend to destabilise the protein structure, we find there is a general tendency for protein stability to decrease across HIV-1's evolutionary history. That a similar pattern is observed in the non-drug resistance lineages indicates that non-resistant mutations, for example, associated with escape from the immune response, also impacts on protein stability. Maintenance of optimal protein structure therefore represents a major constraining factor to the evolution of HIV-1.

18.
Spectrochim Acta A Mol Biomol Spectrosc ; 136 Pt A: 32-41, 2015 Feb 05.
Article in English | MEDLINE | ID: mdl-24274986

ABSTRACT

As intermolecular interactions such as the hydrogen bond are electrostatic in origin, rigorous treatment of this term within force field methodologies should be mandatory. We present a method able of accurately reproducing such interactions for seven van der Waals complexes. It uses atomic multipole moments up to hexadecupole moment mapped to the positions of the nuclear coordinates by the machine learning method kriging. Models were built at three levels of theory: HF/6-31G(**), B3LYP/aug-cc-pVDZ and M06-2X/aug-cc-pVDZ. The quality of the kriging models was measured by their ability to predict the electrostatic interaction energy between atoms in external test examples for which the true energies are known. At all levels of theory, >90% of test cases for small van der Waals complexes were predicted within 1 kJ mol(-1), decreasing to 60-70% of test cases for larger base pair complexes. Models built on moments obtained at B3LYP and M06-2X level generally outperformed those at HF level. For all systems the individual interactions were predicted with a mean unsigned error of less than 1 kJ mol(-1).


Subject(s)
Hydrogen Bonding , Models, Chemical , Models, Molecular , Static Electricity , Ammonia/chemistry , Artificial Intelligence , Water/chemistry
SELECTION OF CITATIONS
SEARCH DETAIL