Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
1.
PLoS Comput Biol ; 19(5): e1011162, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37220151

RESUMO

Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.


Assuntos
Produtos Biológicos , Genoma Bacteriano , Metagenoma , Família Multigênica/genética , Produtos Biológicos/metabolismo , Aprendizado de Máquina Supervisionado
2.
PLoS Comput Biol ; 18(2): e1009853, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35143485

RESUMO

Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications.


Assuntos
Aprendizado de Máquina , Proteínas , Sequência de Aminoácidos , Descoberta de Drogas , Proteínas/química , Especificidade por Substrato
3.
PLoS Comput Biol ; 18(5): e1010045, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-35500014

RESUMO

Identifying structural differences among proteins can be a non-trivial task. When contrasting ensembles of protein structures obtained from molecular dynamics simulations, biologically-relevant features can be easily overshadowed by spurious fluctuations. Here, we present SINATRA Pro, a computational pipeline designed to robustly identify topological differences between two sets of protein structures. Algorithmically, SINATRA Pro works by first taking in the 3D atomic coordinates for each protein snapshot and summarizing them according to their underlying topology. Statistically significant topological features are then projected back onto a user-selected representative protein structure, thus facilitating the visual identification of biophysical signatures of different protein ensembles. We assess the ability of SINATRA Pro to detect minute conformational changes in five independent protein systems of varying complexities. In all test cases, SINATRA Pro identifies known structural features that have been validated by previous experimental and computational studies, as well as novel features that are also likely to be biologically-relevant according to the literature. These results highlight SINATRA Pro as a promising method for facilitating the non-trivial task of pattern recognition in trajectories resulting from molecular dynamics simulations, with substantially increased resolution.


Assuntos
Ciência de Dados , Simulação de Dinâmica Molecular , Biofísica , Conformação Proteica , Proteínas/química
4.
Nat Methods ; 16(8): 687-694, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31308553

RESUMO

Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence-function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function.


Assuntos
Algoritmos , Evolução Molecular Direcionada , Aprendizado de Máquina , Modelos Biológicos , Engenharia de Proteínas/métodos , Proteínas/metabolismo , Humanos , Proteínas/genética
5.
Nat Methods ; 16(11): 1176-1184, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31611694

RESUMO

We engineered light-gated channelrhodopsins (ChRs) whose current strength and light sensitivity enable minimally invasive neuronal circuit interrogation. Current ChR tools applied to the mammalian brain require intracranial surgery for transgene delivery and implantation of fiber-optic cables to produce light-dependent activation of a small volume of tissue. To facilitate expansive optogenetics without the need for invasive implants, our engineering approach leverages the substantial literature of ChR variants to train statistical models for the design of high-performance ChRs. With Gaussian process models trained on a limited experimental set of 102 functionally characterized ChRs, we designed high-photocurrent ChRs with high light sensitivity. Three of these, ChRger1-3, enable optogenetic activation of the nervous system via systemic transgene delivery. ChRger2 enables light-induced neuronal excitation without fiber-optic implantation; that is, this opsin enables transcranial optogenetics.


Assuntos
Channelrhodopsins/genética , Aprendizado de Máquina , Optogenética , Engenharia de Proteínas/métodos , Animais , Channelrhodopsins/fisiologia , Células HEK293 , Humanos , Camundongos , Camundongos Endogâmicos C57BL
6.
Proc Natl Acad Sci U S A ; 114(13): E2624-E2633, 2017 03 28.
Artigo em Inglês | MEDLINE | ID: mdl-28283661

RESUMO

Integral membrane proteins (MPs) are key engineering targets due to their critical roles in regulating cell function. In engineering MPs, it can be extremely challenging to retain membrane localization capability while changing other desired properties. We have used structure-guided SCHEMA recombination to create a large set of functionally diverse chimeras from three sequence-diverse channelrhodopsins (ChRs). We chose 218 ChR chimeras from two SCHEMA libraries and assayed them for expression and plasma membrane localization in human embryonic kidney cells. The majority of the chimeras express, with 89% of the tested chimeras outperforming the lowest-expressing parent; 12% of the tested chimeras express at even higher levels than any of the parents. A significant fraction (23%) also localize to the membrane better than the lowest-performing parent ChR. Most (93%) of these well-localizing chimeras are also functional light-gated channels. Many chimeras have stronger light-activated inward currents than the three parents, and some have unique off-kinetics and spectral properties relative to the parents. An effective method for generating protein sequence and functional diversity, SCHEMA recombination can be used to gain insights into sequence-function relationships in MPs.


Assuntos
Channelrhodopsins/análise , Proteínas Recombinantes de Fusão/análise , Rodopsina/análise , Channelrhodopsins/genética , Channelrhodopsins/metabolismo , Células HEK293 , Humanos , Modelos Moleculares , Proteínas Recombinantes de Fusão/genética , Proteínas Recombinantes de Fusão/metabolismo , Rodopsina/genética , Rodopsina/metabolismo
7.
Bioinformatics ; 34(15): 2642-2648, 2018 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-29584811

RESUMO

Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation: The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina , Modelos Biológicos , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Sequência de Aminoácidos , Bactérias/metabolismo , Eucariotos/metabolismo , Humanos , Proteínas/metabolismo , Proteínas/fisiologia
8.
PLoS Comput Biol ; 13(10): e1005786, 2017 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-29059183

RESUMO

There is growing interest in studying and engineering integral membrane proteins (MPs) that play key roles in sensing and regulating cellular response to diverse external signals. A MP must be expressed, correctly inserted and folded in a lipid bilayer, and trafficked to the proper cellular location in order to function. The sequence and structural determinants of these processes are complex and highly constrained. Here we describe a predictive, machine-learning approach that captures this complexity to facilitate successful MP engineering and design. Machine learning on carefully-chosen training sequences made by structure-guided SCHEMA recombination has enabled us to accurately predict the rare sequences in a diverse library of channelrhodopsins (ChRs) that express and localize to the plasma membrane of mammalian cells. These light-gated channel proteins of microbial origin are of interest for neuroscience applications, where expression and localization to the plasma membrane is a prerequisite for function. We trained Gaussian process (GP) classification and regression models with expression and localization data from 218 ChR chimeras chosen from a 118,098-variant library designed by SCHEMA recombination of three parent ChRs. We use these GP models to identify ChRs that express and localize well and show that our models can elucidate sequence and structure elements important for these processes. We also used the predictive models to convert a naturally occurring ChR incapable of mammalian localization into one that localizes well.


Assuntos
Membrana Celular/química , Desenho de Fármacos , Canais Iônicos/química , Bicamadas Lipídicas/química , Aprendizado de Máquina , Rodopsina/química , Análise de Sequência de Proteína/métodos , Membrana Celular/ultraestrutura , Células HEK293 , Humanos , Canais Iônicos/ultraestrutura , Rodopsina/ultraestrutura , Relação Estrutura-Atividade , Frações Subcelulares/química
9.
10.
Cell Syst ; 15(3): 286-294.e2, 2024 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-38428432

RESUMO

Pretrained protein sequence language models have been shown to improve the performance of many prediction tasks and are now routinely integrated into bioinformatics tools. However, these models largely rely on the transformer architecture, which scales quadratically with sequence length in both run-time and memory. Therefore, state-of-the-art models have limitations on sequence length. To address this limitation, we investigated whether convolutional neural network (CNN) architectures, which scale linearly with sequence length, could be as effective as transformers in protein language models. With masked language model pretraining, CNNs are competitive with, and occasionally superior to, transformers across downstream applications while maintaining strong performance on sequences longer than those allowed in the current state-of-the-art transformer models. Our work suggests that computational efficiency can be improved without sacrificing performance, simply by using a CNN architecture instead of a transformer, and emphasizes the importance of disentangling pretraining task and model architecture. A record of this paper's transparent peer review process is included in the supplemental information.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Sequência de Aminoácidos , Revisão por Pares
11.
Nat Biotechnol ; 2024 Apr 23.
Artigo em Inglês | MEDLINE | ID: mdl-38653796

RESUMO

In recent years, generative protein sequence models have been developed to sample novel sequences. However, predicting whether generated proteins will fold and function remains challenging. We evaluate a set of 20 diverse computational metrics to assess the quality of enzyme sequences produced by three contrasting generative models: ancestral sequence reconstruction, a generative adversarial network and a protein language model. Focusing on two enzyme families, we expressed and purified over 500 natural and generated sequences with 70-90% identity to the most similar natural sequences to benchmark computational metrics for predicting in vitro enzyme activity. Over three rounds of experiments, we developed a computational filter that improved the rate of experimental success by 50-150%. The proposed metrics and models will drive protein engineering research by serving as a benchmark for generative protein sequence models and helping to select active variants for experimental testing.

12.
Nat Commun ; 15(1): 1059, 2024 Feb 05.
Artigo em Inglês | MEDLINE | ID: mdl-38316764

RESUMO

The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a diffusion-based generative model that generates protein backbone structures via a procedure inspired by the natural folding process. We describe a protein backbone structure as a sequence of angles capturing the relative orientation of the constituent backbone atoms, and generate structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins natively twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for more complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release an open-source codebase and trained models for protein structure diffusion.


Assuntos
Dobramento de Proteína , Proteínas , Proteínas/metabolismo , Redes Neurais de Computação , Conformação Proteica
13.
Protein Eng Des Sel ; 362023 Jan 21.
Artigo em Inglês | MEDLINE | ID: mdl-37883472

RESUMO

Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.


Assuntos
Engenharia de Proteínas , Dobramento de Proteína , Sequência de Aminoácidos
14.
ArXiv ; 2023 May 26.
Artigo em Inglês | MEDLINE | ID: mdl-37292483

RESUMO

Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle. Additionally, we provide an outlook for the future based on the current direction of the field, namely in the development of calibrated models and in incorporating other modalities, such as protein structure.

15.
Cell Syst ; 13(4): 274-285.e6, 2022 04 20.
Artigo em Inglês | MEDLINE | ID: mdl-35120643

RESUMO

The degree to which evolution is predictable is a fundamental question in biology. Previous attempts to predict the evolution of protein sequences have been limited to specific proteins and to small changes, such as single-residue mutations. Here, we demonstrate that by using a protein language model to predict the local evolution within protein families, we recover a dynamic "vector field" of protein evolution that we call evolutionary velocity (evo-velocity). Evo-velocity generalizes to evolution over vastly different timescales, from viral proteins evolving over years to eukaryotic proteins evolving over geologic eons, and can predict the evolutionary dynamics of proteins that were not used to develop the original model. Evo-velocity also yields new evolutionary insights by predicting strategies of viral-host immune escape, resolving conflicting theories on the evolution of serpins, and revealing a key role of horizontal gene transfer in the evolution of eukaryotic glycolysis.


Assuntos
Evolução Molecular , Idioma , Sequência de Aminoácidos , Mutação/genética , Proteínas/genética
16.
Curr Opin Struct Biol ; 72: 145-152, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34896756

RESUMO

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.


Assuntos
Aprendizado de Máquina , Engenharia de Proteínas , Sequência de Aminoácidos , Proteínas
17.
J Endourol ; 36(2): 203-208, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34663087

RESUMO

Objectives: To demonstrate feasibility of robot-assisted laparoscopic (RAL) ureteroureterostomy (UU) for benign distal ureteral strictures (DUS) in our robotic reconstruction series with long-term follow-up. Patients and Methods: In a retrospective review of our prospectively maintained RAL ureteral reconstruction database, we followed patients between June 2012 and February 2019 who underwent a UU for DUS. In addition to patient demographics, we recorded the etiology, stricture length, and recurrence rates. Recurrence was defined as findings of recurrent or persistent obstruction by postoperative mercaptoacetyltriglycine diuretic renal scan or the need for additional intervention with ureteral drainage or revisional surgery. Results: We identified 22 patients who underwent a RAL-UU for DUS of benign etiologies. Median age was 42 years (interquartile range [IQR] 39-57) and 20 of 22 patients (90.1%) were women. Median stricture length was 1.5 cm (IQR 1-2). Iatrogenic surgical injury was noted in 16 patients (73%). All ureteral reconstruction was performed using RAL. Postoperative imaging consisted of renal ultrasonography, diuretic renal scan, or cross-sectional radiology within 3 months of the index operation. Further imaging was dependent on clinical judgment. Twenty patients (90.1%) had success with median follow-up time of 54.6 months with two recurrences necessitating RAL ureteroneocystostomy (UNC). Conclusion: RAL-UU for DUS is technically viable and shows promising efficacy in properly selected patients. This technique may serve a niche for preserving the natural anatomical drainage of the bladder and ureter in addition to obviating the sequela of vesicoureteral reflux as seen in UNC.


Assuntos
Laparoscopia , Procedimentos Cirúrgicos Robóticos , Robótica , Ureter , Obstrução Ureteral , Adulto , Constrição Patológica/complicações , Constrição Patológica/cirurgia , Estudos Transversais , Feminino , Seguimentos , Humanos , Laparoscopia/métodos , Estudos Retrospectivos , Procedimentos Cirúrgicos Robóticos/efeitos adversos , Resultado do Tratamento , Ureter/cirurgia , Obstrução Ureteral/etiologia , Obstrução Ureteral/cirurgia
18.
Transl Androl Urol ; 10(5): 2171-2177, 2021 May.
Artigo em Inglês | MEDLINE | ID: mdl-34159099

RESUMO

Since the advent of the robotic surgery, its implementation in urology has been both wide and rapid. Particularly in extirpative surgery for prostate cancer, techniques in robotic-assisted radical prostatectomy have-and continue to-evolve to maximize functional and oncologic outcomes. In this review, we briefly present a historical perspective of the evolution of various robotic techniques, allowing us to contextualize contemporary robotic approaches to radical prostatectomy.

19.
Curr Opin Chem Biol ; 65: 18-27, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34051682

RESUMO

Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.


Assuntos
Aprendizado de Máquina , Engenharia de Proteínas , Sequência de Aminoácidos
20.
Curr Protoc ; 1(5): e113, 2021 May.
Artigo em Inglês | MEDLINE | ID: mdl-33961736

RESUMO

Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.


Assuntos
Inteligência Artificial , Aprendizado Profundo , Aprendizado de Máquina , Processamento de Linguagem Natural , Proteínas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA