RESUMO
Protein developability is requisite for use in therapeutic, diagnostic, or industrial applications. Many developability assays are low throughput, which limits their utility to the later stages of protein discovery and evolution. Recent approaches enable experimental or computational assessment of many more variants, yet the breadth of applicability across protein families and developability metrics is uncertain. Here, three library-scale assays-on-yeast protease, split green fluorescent protein (GFP), and non-specific binding-were evaluated for their ability to predict two key developability outcomes (thermal stability and recombinant expression) for the small protein scaffolds affibody and fibronectin. The assays' predictive capabilities were assessed via both linear correlation and machine learning models trained on the library-scale assay data. The on-yeast protease assay is highly predictive of thermal stability for both scaffolds, and the split-GFP assay is informative of affibody thermal stability and expression. The library-scale data was used to map sequence-developability landscapes for affibody and fibronectin binding paratopes, which guides future design of variants and libraries.
Assuntos
Fibronectinas , Proteínas Recombinantes de Fusão , Fibronectinas/química , Fibronectinas/genética , Fibronectinas/metabolismo , Proteínas Recombinantes de Fusão/genética , Proteínas Recombinantes de Fusão/química , Proteínas Recombinantes de Fusão/metabolismo , Proteínas de Fluorescência Verde/genética , Proteínas de Fluorescência Verde/química , Proteínas de Fluorescência Verde/metabolismo , Engenharia de Proteínas/métodos , Biblioteca de Peptídeos , Estabilidade Proteica , Ligação Proteica , HumanosRESUMO
Antimicrobial peptides (AMPs) are essential elements of natural cellular combat and candidates as antibiotic therapy. Elevated function may be needed for robust physiological performance. Yet, both pure protein design and combinatorial library discovery are hindered by the complexity of antimicrobial activity. We applied a recently developed high-throughput technique, sequence-activity mapping of AMPs via depletion (SAMP-Dep), to proline-rich AMPs. Robust self-inhibition was achieved for metalnikowin 1 (Met) and apidaecin 1b (Api). SAMP-Dep exhibited high reproducibility with correlation coefficients 0.90 and 0.92, for Met and Api, respectively, between replicates and 0.99 and 0.96 for synonymous genetic variants. Sequence-activity maps were obtained via characterization of 26,000 and 34,000 mutants of Met and Api, respectively. Both AMPs exhibit similar mutational profiles including beneficial mutations at one terminus, the C-terminus for Met and N-terminus for Api, which is consistent with their opposite binding orientations in the ribosome. While Met and Api reside with the family of proline-rich AMPs, different proline sites exhibit substantially different mutational tolerance. Within the PRP motif, proline mutation eliminates activity, whereas non-PRP prolines readily tolerate mutation. Homologous mutations are more tolerated, particularly at alternating sites on one 'face' of the peptide. Important and consistent epistasis was observed following the PRP domain within the segment that extends into the ribosomal exit tunnel for both peptides. Variants identified from the SAMP-Dep platform were produced and exposed toward Gram-negative species exogenously, showing either increased potency or specificity for strains tested. In addition to mapping sequence-activity space for fundamental insight and therapeutic engineering, the results advance the robustness of the SAMP-Dep platform for activity characterization.
RESUMO
Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developabilityâquantified by expression, solubility, and stabilityâhinders utility. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 105 of 1020 possible variants of protein ligand scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from a HT dataset and transfer this knowledge to predict recombinant expression beyond observed sequences. The model convolves learned amino acid properties to predict expression levels 44% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine, the importance of hydrophobicity and charge, and the unimportance of aromaticity, when aiming to improve the developability of small proteins. We identify clusters of similar sequences with increased recombinant expression through nonlinear dimensionality reduction and we explore the inferred expression landscape via nested sampling. The analysis enables the first direct visualization of the fitness landscape and highlights the existence of evolutionary bottlenecks in sequence space giving rise to competing subpopulations of sequences with different developability. The work advances applied protein engineering efforts by predicting and interpreting protein scaffold expression from a limited dataset. Furthermore, our statistical mechanical treatment of the problem advances foundational efforts to characterize the structure of the protein fitness landscape and the amino acid characteristics that influence protein developability.