Search | VHL Regional Portal

Inferring protein fitness landscapes from laboratory evolution experiments.

D'Costa, Sameer; Hinds, Emily C; Freschlin, Chase R; Song, Hyebin; Romero, Philip A.

PLoS Comput Biol ; 19(3): e1010956, 2023 03.

Article in English | MEDLINE | ID: mdl-36857380

ABSTRACT

Directed laboratory evolution applies iterative rounds of mutation and selection to explore the protein fitness landscape and provides rich information regarding the underlying relationships between protein sequence, structure, and function. Laboratory evolution data consist of protein sequences sampled from evolving populations over multiple generations and this data type does not fit into established supervised and unsupervised machine learning approaches. We develop a statistical learning framework that models the evolutionary process and can infer the protein fitness landscape from multiple snapshots along an evolutionary trajectory. We apply our modeling approach to dihydrofolate reductase (DHFR) laboratory evolution data and the resulting landscape parameters capture important aspects of DHFR structure and function. We use the resulting model to understand the structure of the fitness landscape and find numerous examples of epistasis but an overall global peak that is evolutionarily accessible from most starting sequences. Finally, we use the model to perform an in silico extrapolation of the DHFR laboratory evolution trajectory and computationally design proteins from future evolutionary rounds.

Subject(s)

Genetic Fitness , Proteins , Genetic Fitness/genetics , Proteins/genetics , Proteins/metabolism , Mutation/genetics , Tetrahydrofolate Dehydrogenase/genetics , Tetrahydrofolate Dehydrogenase/metabolism , Amino Acid Sequence , Evolution, Molecular , Models, Genetic , Epistasis, Genetic

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning.

Song, Hyebin; Bremer, Bennett J; Hinds, Emily C; Raskutti, Garvesh; Romero, Philip A.

Cell Syst ; 12(1): 92-101.e8, 2021 01 20.

Article in English | MEDLINE | ID: mdl-33212013

ABSTRACT

Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

Subject(s)

Machine Learning , Proteins , Amino Acid Sequence

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL