Search | VHL Regional Portal

Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone.

Pantolini, Lorenzo; Studer, Gabriel; Pereira, Joana; Durairaj, Janani; Tauriello, Gerardo; Schwede, Torsten.

Bioinformatics ; 40(1)2024 01 02.

Article in English | MEDLINE | ID: mdl-38175775

ABSTRACT

MOTIVATION: Language models are routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful new approaches in the bioinformatics field. Protein language models (pLMs) generate high-dimensional embeddings on a per-residue level and encode a "semantic meaning" of each individual amino acid in the context of the full protein sequence. These representations have been used as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. RESULTS: In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA) and show how these capture structural similarities even in the twilight zone, outperforming both classical methods as well as other approaches based on pLMs. The method shows excellent accuracy despite the absence of training and parameter optimization. We demonstrate that the combination of pLMs with alignment methods is a valuable approach for the detection of relationships between proteins in the twilight-zone. AVAILABILITY AND IMPLEMENTATION: The code to run EBA and reproduce the analysis described in this article is available at: https://git.scicore.unibas.ch/schwede/EBA and https://git.scicore.unibas.ch/schwede/eba_benchmark.

Subject(s)

Amino Acids , Proteins , Proteins/chemistry , Amino Acid Sequence , Sequence Alignment , Language

Discriminating physiological from non-physiological interfaces in structures of protein complexes: A community-wide study.

Schweke, Hugo; Xu, Qifang; Tauriello, Gerardo; Pantolini, Lorenzo; Schwede, Torsten; Cazals, Frédéric; Lhéritier, Alix; Fernandez-Recio, Juan; Rodríguez-Lumbreras, Luis Angel; Schueler-Furman, Ora; Varga, Julia K; Jiménez-García, Brian; Réau, Manon F; Bonvin, Alexandre M J J; Savojardo, Castrense; Martelli, Pier-Luigi; Casadio, Rita; Tubiana, Jérôme; Wolfson, Haim J; Oliva, Romina; Barradas-Bautista, Didier; Ricciardelli, Tiziana; Cavallo, Luigi; Venclovas, Ceslovas; Olechnovic, Kliment; Guerois, Raphael; Andreani, Jessica; Martin, Juliette; Wang, Xiao; Terashi, Genki; Sarkar, Daipayan; Christoffer, Charles; Aderinwale, Tunde; Verburgt, Jacob; Kihara, Daisuke; Marchand, Anthony; Correia, Bruno E; Duan, Rui; Qiu, Liming; Xu, Xianjin; Zhang, Shuang; Zou, Xiaoqin; Dey, Sucharita; Dunbrack, Roland L; Levy, Emmanuel D; Wodak, Shoshana J.

Proteomics ; 23(17): e2200323, 2023 09.

Article in English | MEDLINE | ID: mdl-37365936

ABSTRACT

Reliably scoring and ranking candidate models of protein complexes and assigning their oligomeric state from the structure of the crystal lattice represent outstanding challenges. A community-wide effort was launched to tackle these challenges. The latest resources on protein complexes and interfaces were exploited to derive a benchmark dataset consisting of 1677 homodimer protein crystal structures, including a balanced mix of physiological and non-physiological complexes. The non-physiological complexes in the benchmark were selected to bury a similar or larger interface area than their physiological counterparts, making it more difficult for scoring functions to differentiate between them. Next, 252 functions for scoring protein-protein interfaces previously developed by 13 groups were collected and evaluated for their ability to discriminate between physiological and non-physiological complexes. A simple consensus score generated using the best performing score of each of the 13 groups, and a cross-validated Random Forest (RF) classifier were created. Both approaches showed excellent performance, with an area under the Receiver Operating Characteristic (ROC) curve of 0.93 and 0.94, respectively, outperforming individual scores developed by different groups. Additionally, AlphaFold2 engines recalled the physiological dimers with significantly higher accuracy than the non-physiological set, lending support to the reliability of our benchmark dataset annotations. Optimizing the combined power of interface scoring functions and evaluating it on challenging benchmark datasets appears to be a promising strategy.

Subject(s)

Proteins , Reproducibility of Results , Proteins/metabolism , Protein Binding

Personalized logical models to investigate cancer response to BRAF treatments in melanomas and colorectal cancers.

Béal, Jonas; Pantolini, Lorenzo; Noël, Vincent; Barillot, Emmanuel; Calzone, Laurence.

PLoS Comput Biol ; 17(1): e1007900, 2021 01.

Article in English | MEDLINE | ID: mdl-33507915

ABSTRACT

The study of response to cancer treatments has benefited greatly from the contribution of different omics data but their interpretation is sometimes difficult. Some mathematical models based on prior biological knowledge of signaling pathways facilitate this interpretation but often require fitting of their parameters using perturbation data. We propose a more qualitative mechanistic approach, based on logical formalism and on the sole mapping and interpretation of omics data, and able to recover differences in sensitivity to gene inhibition without model training. This approach is showcased by the study of BRAF inhibition in patients with melanomas and colorectal cancers who experience significant differences in sensitivity despite similar omics profiles. We first gather information from literature and build a logical model summarizing the regulatory network of the mitogen-activated protein kinase (MAPK) pathway surrounding BRAF, with factors involved in the BRAF inhibition resistance mechanisms. The relevance of this model is verified by automatically assessing that it qualitatively reproduces response or resistance behaviors identified in the literature. Data from over 100 melanoma and colorectal cancer cell lines are then used to validate the model's ability to explain differences in sensitivity. This generic model is transformed into personalized cell line-specific logical models by integrating the omics information of the cell lines as constraints of the model. The use of mutations alone allows personalized models to correlate significantly with experimental sensitivities to BRAF inhibition, both from drug and CRISPR targeting, and even better with the joint use of mutations and RNA, supporting multi-omics mechanistic models. A comparison of these untrained models with learning approaches highlights similarities in interpretation and complementarity depending on the size of the datasets. This parsimonious pipeline, which can easily be extended to other biological questions, makes it possible to explore the mechanistic causes of the response to treatment, on an individualized basis.

Subject(s)

Colorectal Neoplasms , Melanoma , Patient-Specific Modeling , Proto-Oncogene Proteins B-raf/antagonists & inhibitors , Antineoplastic Agents/pharmacology , Antineoplastic Agents/therapeutic use , CRISPR-Cas Systems , Cell Line, Tumor , Colorectal Neoplasms/genetics , Colorectal Neoplasms/metabolism , Colorectal Neoplasms/therapy , Computational Biology , Genetic Therapy , Humans , Machine Learning , Melanoma/genetics , Melanoma/metabolism , Melanoma/therapy , Signal Transduction/drug effects , Transcriptome/drug effects

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL