Language models can identify enzymatic binding sites in protein sequences.

Nana Teukam, Yves Gaetan; Kwate Dassi, Loïc; Manica, Matteo; Probst, Daniel; Schwaller, Philippe; Laino, Teodoro

Nana Teukam, Yves Gaetan; Kwate Dassi, Loïc; Manica, Matteo; Probst, Daniel; Schwaller, Philippe; Laino, Teodoro.

Afiliação

Nana Teukam YG; IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland.
Kwate Dassi L; IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland.
Manica M; IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland.
Probst D; IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland.
Schwaller P; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Switzerland.
Laino T; IBM Research Europe, Saümerstrasse 4, 8803 Rüschlikon, Switzerland.

Comput Struct Biotechnol J ; 23: 1929-1937, 2024 Dec.

Article em En | MEDLINE | ID: mdl-38736695

ABSTRACT

ABSTRACT

Recent advances in language modeling have had a tremendous impact on how we handle sequential data in science. Language architectures have emerged as a hotbed of innovation and creativity in natural language processing over the last decade, and have since gained prominence in modeling proteins and chemical processes, elucidating structural relationships from textual/sequential data. Surprisingly, some of these relationships refer to three-dimensional structural features, raising important questions on the dimensionality of the information encoded within sequential data. Here, we demonstrate that the unsupervised use of a language model architecture to a language representation of bio-catalyzed chemical reactions can capture the signal at the base of the substrate-binding site atomic interactions. This allows us to identify the three-dimensional binding site position in unknown protein sequences. The language representation comprises a reaction-simplified molecular-input line-entry system (SMILES) for substrate and products, and amino acid sequence information for the enzyme. This approach can recover, with no supervision, 52.13% of the binding site when considering co-crystallized substrate-enzyme structures as ground truth, vastly outperforming other attention-based models.

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article