RESUMEN
The accurate prediction of binding between T cell receptors (TCR) and their cognate epitopes is key to understanding the adaptive immune response and developing immunotherapies. Current methods face two significant limitations: the shortage of comprehensive high-quality data and the bias introduced by the selection of the negative training data commonly used in the supervised learning approaches. We propose a method, Transformer-based Unsupervised Language model for Interacting Peptides and T cell receptors (TULIP), that addresses both limitations by leveraging incomplete data and unsupervised learning and using the transformer architecture of language models. Our model is flexible and integrates all possible data sources, regardless of their quality or completeness. We demonstrate the existence of a bias introduced by the sampling procedure used in previous supervised approaches, emphasizing the need for an unsupervised approach. TULIP recognizes the specific TCRs binding an epitope, performing well on unseen epitopes. Our model outperforms state-of-the-art models and offers a promising direction for the development of more accurate TCR epitope recognition models.
Asunto(s)
Péptidos , Receptores de Antígenos de Linfocitos T , Receptores de Antígenos de Linfocitos T/inmunología , Receptores de Antígenos de Linfocitos T/metabolismo , Péptidos/inmunología , Péptidos/química , Péptidos/metabolismo , Humanos , Epítopos/inmunología , Unión Proteica , Epítopos de Linfocito T/inmunología , Aprendizaje Automático no SupervisadoRESUMEN
MOTIVATION: Being able to artificially design novel proteins of desired function is pivotal in many biological and biomedical applications. Generative statistical modeling has recently emerged as a new paradigm for designing amino acid sequences, including in particular models and embedding methods borrowed from natural language processing (NLP). However, most approaches target single proteins or protein domains, and do not take into account any functional specificity or interaction with the context. To extend beyond current computational strategies, we develop a method for generating protein domain sequences intended to interact with another protein domain. Using data from natural multidomain proteins, we cast the problem as a translation problem from a given interactor domain to the new domain to be generated, i.e. we generate artificial partner sequences conditional on an input sequence. We also show in an example that the same procedure can be applied to interactions between distinct proteins. RESULTS: Evaluating our model's quality using diverse metrics, in part related to distinct biological questions, we show that our method outperforms state-of-the-art shallow autoregressive strategies. We also explore the possibility of fine-tuning pretrained large language models for the same task and of using Alphafold 2 for assessing the quality of sampled sequences. AVAILABILITY AND IMPLEMENTATION: Data and code on https://github.com/barthelemymp/Domain2DomainProteinTranslation.
Asunto(s)
Lenguaje , Proteínas , Secuencia de Aminoácidos , Proteínas/química , Dominios ProteicosRESUMEN
Many different types of generative models for protein sequences have been proposed in literature. Their uses include the prediction of mutational effects, protein design and the prediction of structural properties. Neural network (NN) architectures have shown great performances, commonly attributed to the capacity to extract non-trivial higher-order interactions from the data. In this work, we analyze two different NN models and assess how close they are to simple pairwise distributions, which have been used in the past for similar problems. We present an approach for extracting pairwise models from more complex ones using an energy-based modeling framework. We show that for the tested models the extracted pairwise models can replicate the energies of the original models and are also close in performance in tasks like mutational effect prediction. In addition, we show that even simpler, factorized models often come close in performance to the original models.
Asunto(s)
Destilación , Redes Neurales de la Computación , Secuencia de Aminoácidos , Proteínas/químicaRESUMEN
Understanding the extreme variation among bacterial genomes remains an unsolved challenge in evolutionary biology, despite long-standing debate about the relative importance of natural selection, mutation, and random drift. A potentially important confounding factor is the variation in mutation rates between lineages and over evolutionary history, which has been documented in several species. Mutation accumulation experiments have shown that hypermutability can erode genomes over short timescales. These results, however, were obtained under conditions of extremely weak selection, casting doubt on their general relevance. Here, we circumvent this limitation by analyzing genomes from mutator populations that arose during a long-term experiment with Escherichia coli, in which populations have been adaptively evolving for >50,000 generations. We develop an analytical framework to quantify the relative contributions of mutation and selection in shaping genomic characteristics, and we validate it using genomes evolved under regimes of high mutation rates with weak selection (mutation accumulation experiments) and low mutation rates with strong selection (natural isolates). Our results show that, despite sustained adaptive evolution in the long-term experiment, the signature of selection is much weaker than that of mutational biases in mutator genomes. This finding suggests that relatively brief periods of hypermutability can play an outsized role in shaping extant bacterial genomes. Overall, these results highlight the importance of genomic draft, in which strong linkage limits the ability of selection to purge deleterious mutations. These insights are also relevant to other biological systems evolving under strong linkage and high mutation rates, including viruses and cancer cells.
Asunto(s)
Escherichia coli/genética , Evolución Molecular , Genoma Bacteriano , Selección Genética , Escherichia coli/fisiología , Mutación , Tasa de Mutación , FilogeniaRESUMEN
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
Asunto(s)
Física/métodos , Proteínas/química , Secuencia de Aminoácidos , Anotación de Secuencia Molecular , Proteínas/metabolismo , Homología de Secuencia de AminoácidoRESUMEN
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date.
Asunto(s)
Biología Computacional/métodos , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Modelos Estadísticos , Alineación de SecuenciaRESUMEN
Interaction between proteins is a fundamental mechanism that underlies virtually all biological processes. Many important interactions are conserved across a large variety of species. The need to maintain interaction leads to a high degree of co-evolution between residues in the interface between partner proteins. The inference of protein-protein interaction networks from the rapidly growing sequence databases is one of the most formidable tasks in systems biology today. We propose here a novel approach based on the Direct-Coupling Analysis of the co-evolution between inter-protein residue pairs. We use ribosomal and trp operon proteins as test cases: For the small resp. large ribosomal subunit our approach predicts protein-interaction partners at a true-positive rate of 70% resp. 90% within the first 10 predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all predictions. In the trp operon, it assigns the two largest interaction scores to the only two interactions experimentally known. On the level of residue interactions we show that for both the small and the large ribosomal subunit our approach predicts interacting residues in the system with a true positive rate of 60% and 85% in the first 20 predictions. We use artificial data to show that the performance of our approach depends crucially on the size of the joint multiple sequence alignments and analyze how many sequences would be necessary for a perfect prediction if the sequences were sampled from the same model that we use for prediction. Given the performance of our approach on the test data we speculate that it can be used to detect new interactions, especially in the light of the rapid growth of available sequence data.
Asunto(s)
Escherichia coli/genética , Evolución Molecular , Operón/genética , Mapeo de Interacción de Proteínas , Ribosomas/metabolismo , Triptófano/genética , Algoritmos , Secuencia de Aminoácidos , Animales , Vías Biosintéticas , Bovinos , Simulación por Computador , Subunidades Ribosómicas Grandes/metabolismo , Subunidades Ribosómicas Pequeñas/metabolismo , Alineación de Secuencia , Triptófano/biosíntesisRESUMEN
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code.
Asunto(s)
Modelos Moleculares , Proteínas/química , Proteínas/metabolismo , Bacterias/citología , Análisis Multivariante , Distribución Normal , Unión Proteica , Conformación Proteica , Estructura Terciaria de Proteína , Alineación de Secuencia , Transducción de Señal , Factores de TiempoRESUMEN
Zinc finger domains are one of the most common structural motifs in eukaryotic cells, which employ the motif in some of their most important proteins (including TFIIIA, CTCF, and ZiF268). These DNA binding proteins contain up to 37 zinc finger domains connected by flexible linker regions. They have been shown to be important organizers of the 3D structure of chromosomes and as such are called the master weaver of the genome. Using NMR and numerical simulations, much progress has been made during the past few decades in understanding their various functions and their ways of binding to the DNA, but a large knowledge gap remains to be filled. One problem of the hitherto existing theoretical models of zinc finger protein DNA binding in this context is that they are aimed at describing specific binding. Furthermore, they exclusively focus on the microscopic details or approach the problem without considering such details at all. We present the Flexible Linker Model, which aims explicitly at describing nonspecific binding. It takes into account the most important effects of flexible linkers and allows a qualitative investigation of the effects of these linkers on the nonspecific binding affinity of zinc finger proteins to DNA. Our results indicate that the binding affinity is increased by the flexible linkers by several orders of magnitude. Moreover, they show that the binding map for proteins with more than one domain presents interesting structures, which have been neither observed nor described before, and can be interpreted to fit very well with existing theories of facilitated target location. The effect of the increased binding affinity is also in agreement with recent experiments that until now have lacked an explanation. We further explore the class of proteins with flexible linkers, which are unstructured until they bind. We have developed a methodology to characterize these flexible proteins. Employing the concept of barcodes, we propose a measure to compare such flexible proteins in terms of a similarity measure. This measure is validated by a comparison between a geometric similarity measure and the topological similarity measure that takes geometry as well as topology into account.