Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 33
Filter
Add more filters










Publication year range
1.
bioRxiv ; 2024 Mar 17.
Article in English | MEDLINE | ID: mdl-38559182

ABSTRACT

Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.

2.
Nat Chem Eng ; 1(1): 97-107, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38468718

ABSTRACT

Protein engineering has nearly limitless applications across chemistry, energy and medicine, but creating new proteins with improved or novel functions remains slow, labor-intensive and inefficient. Here we present the Self-driving Autonomous Machines for Protein Landscape Exploration (SAMPLE) platform for fully autonomous protein engineering. SAMPLE is driven by an intelligent agent that learns protein sequence-function relationships, designs new proteins and sends designs to a fully automated robotic system that experimentally tests the designed proteins and provides feedback to improve the agent's understanding of the system. We deploy four SAMPLE agents with the goal of engineering glycoside hydrolase enzymes with enhanced thermal tolerance. Despite showing individual differences in their search behavior, all four agents quickly converge on thermostable enzymes. Self-driving laboratories automate and accelerate the scientific discovery process and hold great potential for the fields of protein engineering and synthetic biology.

3.
bioRxiv ; 2023 Nov 09.
Article in English | MEDLINE | ID: mdl-37987009

ABSTRACT

Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape.

4.
PLoS Comput Biol ; 19(3): e1010956, 2023 03.
Article in English | MEDLINE | ID: mdl-36857380

ABSTRACT

Directed laboratory evolution applies iterative rounds of mutation and selection to explore the protein fitness landscape and provides rich information regarding the underlying relationships between protein sequence, structure, and function. Laboratory evolution data consist of protein sequences sampled from evolving populations over multiple generations and this data type does not fit into established supervised and unsupervised machine learning approaches. We develop a statistical learning framework that models the evolutionary process and can infer the protein fitness landscape from multiple snapshots along an evolutionary trajectory. We apply our modeling approach to dihydrofolate reductase (DHFR) laboratory evolution data and the resulting landscape parameters capture important aspects of DHFR structure and function. We use the resulting model to understand the structure of the fitness landscape and find numerous examples of epistasis but an overall global peak that is evolutionarily accessible from most starting sequences. Finally, we use the model to perform an in silico extrapolation of the DHFR laboratory evolution trajectory and computationally design proteins from future evolutionary rounds.


Subject(s)
Genetic Fitness , Proteins , Genetic Fitness/genetics , Proteins/genetics , Proteins/metabolism , Mutation/genetics , Tetrahydrofolate Dehydrogenase/genetics , Tetrahydrofolate Dehydrogenase/metabolism , Amino Acid Sequence , Evolution, Molecular , Models, Genetic , Epistasis, Genetic
5.
Protein Sci ; 32(4): e4597, 2023 04.
Article in English | MEDLINE | ID: mdl-36794431

ABSTRACT

Angiotensin-converting enzyme 2 (ACE2) has been investigated for its ability to beneficially modulate the angiotensin receptor (ATR) therapeutic axis to treat multiple human diseases. Its broad substrate scope and diverse physiological roles, however, limit its potential as a therapeutic agent. In this work, we address this limitation by establishing a yeast display-based liquid chromatography screen that enabled use of directed evolution to discover ACE2 variants that possess both wild-type or greater Ang-II hydrolytic activity and improved specificity toward Ang-II relative to the off-target peptide substrate Apelin-13. To obtain these results, we screened ACE2 active site libraries to reveal three substitution-tolerant positions (M360, T371, and Y510) that can be mutated to enhance ACE2's activity profile and followed up on these hits with focused double mutant libraries to further improve the enzyme. Relative to wild-type ACE2, our top variant (T371L/Y510Ile) displayed a sevenfold increase in Ang-II turnover number (kcat ), a sixfold diminished catalytic efficiency (kcat /Km ) on Apelin-13, and an overall decreased activity on other ACE2 substrates that were not directly assayed in the directed evolution screen. At physiologically relevant substrate concentrations, T371L/Y510Ile hydrolyzes as much or more Ang-II than wild-type ACE2 with concomitant Ang-II:Apelin-13 specificity improvements reaching 30-fold. Our efforts have delivered ATR axis-acting therapeutic candidates with relevance to both established and unexplored ACE2 therapeutic applications and provide a foundation for further ACE2 engineering efforts.


Subject(s)
Angiotensin-Converting Enzyme 2 , Peptidyl-Dipeptidase A , Humans , Peptidyl-Dipeptidase A/genetics , Peptide Fragments , Angiotensin I , Peptides
6.
Cell Rep Methods ; 2(7): 100242, 2022 07 18.
Article in English | MEDLINE | ID: mdl-35880021

ABSTRACT

In this work, we developed a simple and robust assay to rapidly detect SNPs in nucleic acid samples. Our approach combines loop-mediated isothermal amplification (LAMP)-based target amplification with fluorescent probes to detect SNPs with high specificity. A competitive "sink" strand preferentially binds to non-SNP amplicons and shifts the free energy landscape to favor specific activation by SNP products. We demonstrated the broad utility and reliability of our SNP-LAMP method by detecting three distinct SNPs across the human genome. We also designed an assay to rapidly detect highly transmissible severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants from crude biological samples. This work demonstrates that competitive SNP-LAMP is a powerful and universal method that could be applied in point-of-care settings to detect any target SNP with high specificity and sensitivity. We additionally developed a publicly available web application for researchers to design SNP-LAMP probes for any target sequence of interest.


Subject(s)
COVID-19 , Polymorphism, Single Nucleotide , Humans , Polymorphism, Single Nucleotide/genetics , COVID-19/genetics , SARS-CoV-2/genetics , Reproducibility of Results , Point-of-Care Systems
7.
Curr Opin Biotechnol ; 75: 102713, 2022 06.
Article in English | MEDLINE | ID: mdl-35413604

ABSTRACT

Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.


Subject(s)
Machine Learning , Protein Engineering , Amino Acid Sequence , Biotechnology , Protein Engineering/methods , Proteins/chemistry
8.
Protein Eng Des Sel ; 352022 02 17.
Article in English | MEDLINE | ID: mdl-35174856

ABSTRACT

Understanding how severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) interacts with different mammalian angiotensin-converting enzyme II (ACE2) cell entry receptors elucidates determinants of virus transmission and facilitates development of vaccines for humans and animals. Yeast display-based directed evolution identified conserved ACE2 mutations that increase spike binding across multiple species. Gln42Leu increased ACE2-spike binding for human and four of four other mammalian ACE2s; Leu79Ile had an effect for human and three of three mammalian ACE2s. These residues are highly represented, 83% for Gln42 and 56% for Leu79, among mammalian ACE2s. The above findings can be important in protecting humans and animals from existing and future SARS-CoV-2 variants.


Subject(s)
COVID-19 , SARS-CoV-2 , Angiotensin-Converting Enzyme 2 , Animals , Humans , Mutation , Protein Binding , Saccharomyces cerevisiae/metabolism , Spike Glycoprotein, Coronavirus/genetics
10.
Cell Death Discov ; 8(1): 7, 2022 Jan 10.
Article in English | MEDLINE | ID: mdl-35013287

ABSTRACT

The human caspase family comprises 12 cysteine proteases that are centrally involved in cell death and inflammation responses. The members of this family have conserved sequences and structures, highly similar enzymatic activities and substrate preferences, and overlapping physiological roles. In this paper, we present a deep mutational scan of the executioner caspases CASP3 and CASP7 to dissect differences in their structure, function, and regulation. Our approach leverages high-throughput microfluidic screening to analyze hundreds of thousands of caspase variants in tightly controlled in vitro reactions. The resulting data provides a large-scale and unbiased view of the impact of amino acid substitutions on the proteolytic activity of CASP3 and CASP7. We use this data to pinpoint key functional differences between CASP3 and CASP7, including a secondary internal cleavage site, CASP7 Q196 that is not present in CASP3. Our results will open avenues for inquiry in caspase function and regulation that could potentially inform the development of future caspase-specific therapeutics.

11.
bioRxiv ; 2022 Jan 04.
Article in English | MEDLINE | ID: mdl-33758860

ABSTRACT

Understanding how SARS-CoV-2 interacts with different mammalian angiotensin-converting enzyme II (ACE2) cell entry receptors elucidates determinants of virus transmission and facilitates development of vaccines for humans and animals. Yeast display-based directed evolution identified conserved ACE2 mutations that increase spike binding across multiple species. Gln42Leu increased ACE2-spike binding for human and four of four other mammalian ACE2s; Leu79Ile had a effect for human and three of three mammalian ACE2s. These residues are highly represented, 83% for Gln42 and 56% for Leu79, among mammalian ACE2s. The above findings can be important in protecting humans and animals from existing and future SARS-CoV-2 variants.

12.
Proc Natl Acad Sci U S A ; 118(48)2021 11 30.
Article in English | MEDLINE | ID: mdl-34815338

ABSTRACT

The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network's internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks' ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models' ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.


Subject(s)
Amino Acid Sequence/genetics , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence/physiology , Biochemical Phenomena , Deep Learning , Machine Learning , Mutation , Neural Networks, Computer , Proteins/metabolism , Structure-Activity Relationship
13.
Nat Commun ; 12(1): 5825, 2021 10 05.
Article in English | MEDLINE | ID: mdl-34611172

ABSTRACT

Alcohol-forming fatty acyl reductases (FARs) catalyze the reduction of thioesters to alcohols and are key enzymes for microbial production of fatty alcohols. Many metabolic engineering strategies utilize FARs to produce fatty alcohols from intracellular acyl-CoA and acyl-ACP pools; however, enzyme activity, especially on acyl-ACPs, remains a significant bottleneck to high-flux production. Here, we engineer FARs with enhanced activity on acyl-ACP substrates by implementing a machine learning (ML)-driven approach to iteratively search the protein fitness landscape. Over the course of ten design-test-learn rounds, we engineer enzymes that produce over twofold more fatty alcohols than the starting natural sequences. We characterize the top sequence and show that it has an enhanced catalytic rate on palmitoyl-ACP. Finally, we analyze the sequence-function data to identify features, like the net charge near the substrate-binding site, that correlate with in vivo activity. This work demonstrates the power of ML to navigate the fitness landscape of traditionally difficult-to-engineer proteins.


Subject(s)
Aldehyde Oxidoreductases/metabolism , Fatty Alcohols/metabolism , Machine Learning , Aldehyde Oxidoreductases/genetics , Metabolic Engineering/methods
14.
Metab Eng ; 67: 216-226, 2021 09.
Article in English | MEDLINE | ID: mdl-34229079

ABSTRACT

In order to make renewable fuels and chemicals from microbes, new methods are required to engineer microbes more intelligently. Computational approaches, to engineer strains for enhanced chemical production typically rely on detailed mechanistic models (e.g., kinetic/stoichiometric models of metabolism)-requiring many experimental datasets for their parameterization-while experimental methods may require screening large mutant libraries to explore the design space for the few mutants with desired behaviors. To address these limitations, we developed an active and machine learning approach (ActiveOpt) to intelligently guide experiments to arrive at an optimal phenotype with minimal measured datasets. ActiveOpt was applied to two separate case studies to evaluate its potential to increase valine yields and neurosporene productivity in Escherichia coli. In both the cases, ActiveOpt identified the best performing strain in fewer experiments than the case studies used. This work demonstrates that machine and active learning approaches have the potential to greatly facilitate metabolic engineering efforts to rapidly achieve its objectives.


Subject(s)
Machine Learning , Metabolic Engineering , Escherichia coli/genetics , Phenotype
15.
Nucleic Acids Res ; 49(18): e103, 2021 10 11.
Article in English | MEDLINE | ID: mdl-34233007

ABSTRACT

Experimental methods that capture the individual properties of single cells are revealing the key role of cell-to-cell variability in countless biological processes. These single-cell methods are becoming increasingly important across the life sciences in fields such as immunology, regenerative medicine and cancer biology. In addition to high-dimensional transcriptomic techniques such as single-cell RNA sequencing, there is a need for fast, simple and high-throughput assays to enumerate cell samples based on RNA biomarkers. In this work, we present single-cell nucleic acid profiling in droplets (SNAPD) to analyze sets of transcriptional markers in tens of thousands of single mammalian cells. Individual cells are encapsulated in aqueous droplets on a microfluidic chip and the RNA markers in each cell are amplified. Molecular logic circuits then integrate these amplicons to categorize cells based on the transcriptional markers and produce a detectable fluorescence output. SNAPD is capable of analyzing over 100,000 cells per hour and can be used to quantify distinct cell types within heterogeneous populations, detect rare cells at frequencies down to 0.1% and enrich specific cell types using microfluidic sorting. SNAPD provides a simple, rapid, low cost and scalable approach to study complex phenotypes in heterogeneous cell populations.


Subject(s)
High-Throughput Screening Assays/methods , Microfluidic Analytical Techniques/methods , Microfluidics/methods , Nucleic Acids/analysis , Single-Cell Analysis/methods , Cell Line , Humans , Lab-On-A-Chip Devices , Transcriptome
16.
PLoS One ; 16(5): e0251585, 2021.
Article in English | MEDLINE | ID: mdl-33979391

ABSTRACT

Understanding how human ACE2 genetic variants differ in their recognition by SARS-CoV-2 can facilitate the leveraging of ACE2 as an axis for treating and preventing COVID-19. In this work, we experimentally interrogate thousands of ACE2 mutants to identify over one hundred human single-nucleotide variants (SNVs) that are likely to have altered recognition by the virus, and make the complementary discovery that ACE2 residues distant from the spike interface influence the ACE2-spike interaction. These findings illuminate new links between ACE2 sequence and spike recognition, and could find substantial utility in further fundamental research that augments epidemiological analyses and clinical trial design in the contexts of both existing strains of SARS-CoV-2 and novel variants that may arise in the future.


Subject(s)
Angiotensin-Converting Enzyme 2/genetics , COVID-19/metabolism , Spike Glycoprotein, Coronavirus/genetics , Angiotensin-Converting Enzyme 2/metabolism , Binding Sites/genetics , COVID-19/genetics , Genetic Variation/genetics , Humans , Models, Molecular , Peptidyl-Dipeptidase A/metabolism , Polymorphism, Single Nucleotide/genetics , Protein Binding/genetics , Receptors, Virus/genetics , SARS-CoV-2/genetics , SARS-CoV-2/metabolism , SARS-CoV-2/pathogenicity , Spike Glycoprotein, Coronavirus/metabolism , Virus Replication/genetics
17.
Cell Syst ; 12(1): 92-101.e8, 2021 01 20.
Article in English | MEDLINE | ID: mdl-33212013

ABSTRACT

Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.


Subject(s)
Machine Learning , Proteins , Amino Acid Sequence
18.
bioRxiv ; 2020 Sep 17.
Article in English | MEDLINE | ID: mdl-32995796

ABSTRACT

Understanding how human ACE2 genetic variants differ in their recognition by SARS-CoV-2 can have a major impact in leveraging ACE2 as an axis for treating and preventing COVID-19. In this work, we experimentally interrogate thousands of ACE2 mutants to identify over one hundred human single-nucleotide variants (SNVs) that are likely to have altered recognition by the virus, and make the complementary discovery that ACE2 residues distant from the spike interface can have a strong influence upon the ACE2-spike interaction. These findings illuminate new links between ACE2 sequence and spike recognition, and will find wide-ranging utility in SARS-CoV-2 fundamental research, epidemiological analyses, and clinical trial design.

19.
Nat Commun ; 11(1): 2418, 2020 05 15.
Article in English | MEDLINE | ID: mdl-32415107

ABSTRACT

The spatial organization of microbial communities arises from a complex interplay of biotic and abiotic interactions, and is a major determinant of ecosystem functions. Here we design a microfluidic platform to investigate how the spatial arrangement of microbes impacts gene expression and growth. We elucidate key biochemical parameters that dictate the mapping between spatial positioning and gene expression patterns. We show that distance can establish a low-pass filter to periodic inputs and can enhance the fidelity of information processing. Positive and negative feedback can play disparate roles in the synchronization and robustness of a genetic oscillator distributed between two strains to spatial separation. Quantification of growth and metabolite release in an amino-acid auxotroph community demonstrates that the interaction network and stability of the community are highly sensitive to temporal perturbations and spatial arrangements. In sum, our microfluidic platform can quantify spatiotemporal parameters influencing diffusion-mediated interactions in microbial consortia.


Subject(s)
Lab-On-A-Chip Devices , Microbial Consortia , Signal Transduction , Ecology , Ecosystem , Equipment Design , Escherichia coli/physiology , Gastrointestinal Microbiome , Gene Expression Regulation, Bacterial , Microfluidics/instrumentation , Models, Genetic , Oscillometry , Quorum Sensing
20.
Cell Syst ; 9(3): 229-242.e4, 2019 09 25.
Article in English | MEDLINE | ID: mdl-31494089

ABSTRACT

Microbial interactions are major drivers of microbial community dynamics and functions but remain challenging to identify because of limitations in parallel culturing and absolute abundance quantification of community members across environments and replicates. To this end, we developed Microbial Interaction Network Inference in microdroplets (MINI-Drop). Fluorescence microscopy coupled to computer vision techniques were used to rapidly determine the absolute abundance of each strain in hundreds to thousands of droplets per condition. We showed that MINI-Drop could accurately infer pairwise and higher-order interactions in synthetic consortia. We developed a stochastic model of community assembly to provide insight into the heterogeneity in community states across droplets. Finally, we elucidated the complex web of interactions linking antibiotics and different species in a synthetic consortium. In sum, we demonstrated a robust and generalizable method to infer microbial interaction networks by random encapsulation of sub-communities into microfluidic droplets.


Subject(s)
Lipid Droplets/microbiology , Microbial Consortia/physiology , Microbial Interactions/physiology , Microfluidics/methods , Animals , Anti-Bacterial Agents/metabolism , Biodiversity , Host-Pathogen Interactions , Humans , Microscopy, Fluorescence
SELECTION OF CITATIONS
SEARCH DETAIL
...