Search | VHL Regional Portal

Machine learning for functional protein design.

Notin, Pascal; Rollins, Nathan; Gal, Yarin; Sander, Chris; Marks, Debora.

Nat Biotechnol ; 42(2): 216-228, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38361074

ABSTRACT

Recent breakthroughs in AI coupled with the rapid accumulation of protein sequence and structure data have radically transformed computational protein design. New methods promise to escape the constraints of natural and laboratory evolution, accelerating the generation of proteins for applications in biotechnology and medicine. To make sense of the exploding diversity of machine learning approaches, we introduce a unifying framework that classifies models on the basis of their use of three core data modalities: sequences, structures and functional labels. We discuss the new capabilities and outstanding challenges for the practical design of enzymes, antibodies, vaccines, nanomachines and more. We then highlight trends shaping the future of this field, from large-scale assays to more robust benchmarks, multimodal foundation models, enhanced sampling strategies and laboratory automation.

Subject(s)

Machine Learning , Proteins , Biotechnology , Amino Acid Sequence , Antibodies

ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers.

Notin, Pascal; Marks, Debora S; Weitzman, Ruben; Gal, Yarin.

bioRxiv ; 2023 Dec 07.

Article in English | MEDLINE | ID: mdl-38106034

ABSTRACT

Protein design holds immense potential for optimizing naturally occurring proteins, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, such as an expansive design space, sparse functional regions, and a scarcity of available labels. These issues are further exacerbated in practice by the fact most real-life design scenarios necessitate the simultaneous optimization of multiple properties. In this work, we introduce ProteinNPT, a non-parametric transformer variant tailored to protein sequences and particularly suited to label-scarce and multi-task learning settings. We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.

ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction.

Notin, Pascal; Kollasch, Aaron W; Ritter, Daniel; van Niekerk, Lood; Paul, Steffanie; Spinner, Hansen; Rollins, Nathan; Shaw, Ada; Weitzman, Ruben; Frazer, Jonathan; Dias, Mafalda; Franceschi, Dinko; Orenbuch, Rose; Gal, Yarin; Marks, Debora S.

bioRxiv ; 2023 Dec 08.

Article in English | MEDLINE | ID: mdl-38106144

ABSTRACT

Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 70 high-performing models from various subfields (eg., alignment-based, inverse folding) into a unified benchmark suite. We open source the corresponding codebase, datasets, MSAs, structures, model predictions and develop a user-friendly website that facilitates data access and analysis.

Learning from prepandemic data to forecast viral escape.

Thadani, Nicole N; Gurev, Sarah; Notin, Pascal; Youssef, Noor; Rollins, Nathan J; Ritter, Daniel; Sander, Chris; Gal, Yarin; Marks, Debora S.

Nature ; 622(7984): 818-825, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37821700

ABSTRACT

Effective pandemic preparedness relies on anticipating viral mutations that are able to evade host immune responses to facilitate vaccine and therapeutic design. However, current strategies for viral evolution prediction are not available early in a pandemic-experimental approaches require host polyclonal antibodies to test against1-16, and existing computational methods draw heavily from current strain prevalence to make reliable predictions of variants of concern17-19. To address this, we developed EVEscape, a generalizable modular framework that combines fitness predictions from a deep learning model of historical sequences with biophysical and structural information. EVEscape quantifies the viral escape potential of mutations at scale and has the advantage of being applicable before surveillance sequencing, experimental scans or three-dimensional structures of antibody complexes are available. We demonstrate that EVEscape, trained on sequences available before 2020, is as accurate as high-throughput experimental scans at anticipating pandemic variation for SARS-CoV-2 and is generalizable to other viruses including influenza, HIV and understudied viruses with pandemic potential such as Lassa and Nipah. We provide continually revised escape scores for all current strains of SARS-CoV-2 and predict probable further mutations to forecast emerging strains as a tool for continuing vaccine development ( evescape.org ).

Subject(s)

Evolution, Molecular , Forecasting , Immune Evasion , Mutation , Pandemics , Viruses , Humans , Drug Design , HIV Infections , Immune Evasion/genetics , Immune Evasion/immunology , Influenza, Human , Lassa virus , Nipah Virus , SARS-CoV-2/genetics , SARS-CoV-2/immunology , Viral Vaccines/immunology , Viruses/genetics , Viruses/immunology

Mixtures of large-scale dynamic functional brain network modes.

Gohil, Chetan; Roberts, Evan; Timms, Ryan; Skates, Alex; Higgins, Cameron; Quinn, Andrew; Pervaiz, Usama; van Amersfoort, Joost; Notin, Pascal; Gal, Yarin; Adaszewski, Stanislaw; Woolrich, Mark.

Neuroimage ; 263: 119595, 2022 11.

Article in English | MEDLINE | ID: mdl-36041643

ABSTRACT

Accurate temporal modelling of functional brain networks is essential in the quest for understanding how such networks facilitate cognition. Researchers are beginning to adopt time-varying analyses for electrophysiological data that capture highly dynamic processes on the order of milliseconds. Typically, these approaches, such as clustering of functional connectivity profiles and Hidden Markov Modelling (HMM), assume mutual exclusivity of networks over time. Whilst a powerful constraint, this assumption may be compromising the ability of these approaches to describe the data effectively. Here, we propose a new generative model for functional connectivity as a time-varying linear mixture of spatially distributed statistical "modes". The temporal evolution of this mixture is governed by a recurrent neural network, which enables the model to generate data with a rich temporal structure. We use a Bayesian framework known as amortised variational inference to learn model parameters from observed data. We call the approach DyNeMo (for Dynamic Network Modes), and show using simulations it outperforms the HMM when the assumption of mutual exclusivity is violated. In resting-state MEG, DyNeMo reveals a mixture of modes that activate on fast time scales of 100-150 ms, which is similar to state lifetimes found using an HMM. In task MEG data, DyNeMo finds modes with plausible, task-dependent evoked responses without any knowledge of the task timings. Overall, DyNeMo provides decompositions that are an approximate remapping of the HMM's while showing improvements in overall explanatory power. However, the magnitude of the improvements suggests that the HMM's assumption of mutual exclusivity can be reasonable in practice. Nonetheless, DyNeMo provides a flexible framework for implementing and assessing future modelling developments.

Subject(s)

Magnetic Resonance Imaging , Nerve Net , Humans , Bayes Theorem , Nerve Net/diagnostic imaging , Nerve Net/physiology , Brain/diagnostic imaging , Brain/physiology , Cognition

Publisher Correction: Disease variant prediction with deep generative models of evolutionary data.

Frazer, Jonathan; Notin, Pascal; Dias, Mafalda; Gomez, Aidan; Min, Joseph K; Brock, Kelly; Gal, Yarin; Marks, Debora S.

Nature ; 601(7892): E7, 2022 Jan.

Article in English | MEDLINE | ID: mdl-34921310

Disease variant prediction with deep generative models of evolutionary data.

Frazer, Jonathan; Notin, Pascal; Dias, Mafalda; Gomez, Aidan; Min, Joseph K; Brock, Kelly; Gal, Yarin; Marks, Debora S.

Nature ; 599(7883): 91-95, 2021 11.

Article in English | MEDLINE | ID: mdl-34707284

ABSTRACT

Quantifying the pathogenicity of protein variants in human disease-related genes would have a marked effect on clinical decisions, yet the overwhelming majority (over 98%) of these variants still have unknown consequences1-3. In principle, computational methods could support the large-scale interpretation of genetic variants. However, state-of-the-art methods4-10 have relied on training machine learning models on known disease labels. As these labels are sparse, biased and of variable quality, the resulting models have been considered insufficiently reliable11. Here we propose an approach that leverages deep generative models to predict variant pathogenicity without relying on labels. By modelling the distribution of sequence variation across organisms, we implicitly capture constraints on the protein sequences that maintain fitness. Our model EVE (evolutionary model of variant effect) not only outperforms computational approaches that rely on labelled data but also performs on par with, if not better than, predictions from high-throughput experiments, which are increasingly used as evidence for variant classification12-16. We predict the pathogenicity of more than 36 million variants across 3,219 disease genes and provide evidence for the classification of more than 256,000 variants of unknown significance. Our work suggests that models of evolutionary information can provide valuable independent evidence for variant interpretation that will be widely useful in research and clinical settings.

Subject(s)

Disease/genetics , Evolution, Molecular , Genetic Fitness/genetics , Genetic Variation , Proteins/genetics , Selection, Genetic , Unsupervised Machine Learning , Bayes Theorem , Biological Assay , Genetic Predisposition to Disease/genetics , Humans , Models, Molecular , Phenotype , Proteins/metabolism

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL