Machine learning-based prediction of proteins' architecture using sequences of amino acids and structural alphabets.

Abbass, Jad; Parisi, Charles

Abbass, Jad; Parisi, Charles.

Afiliação

Abbass J; School of Computer Science and Mathematics, Kingston University, London, UK.
Parisi C; School of Computer Science and Mathematics, Kingston University, London, UK.

J Biomol Struct Dyn ; : 1-16, 2024 Mar 20.

Article em En | MEDLINE | ID: mdl-38505995

ABSTRACT

ABSTRACT

In addition to the growth of protein structures generated through wet laboratory experiments and deposited in the PDB repository, AlphaFold predictions have significantly contributed to the creation of a much larger database of protein structures. Annotating such a vast number of structures has become an increasingly challenging task. CATH is widely recognized as one the most common platforms for addressing this challenge, as it classifies proteins based on their structural and evolutionary relationships, offering the scientific community an invaluable resource for uncovering various properties, including functional annotations. While CATH annotation involves - to some extent - human intervention, keeping up with the classification of the rapidly expanding repositories of protein structures has become exceedingly difficult. Therefore, there is a pressing need for a fully automated approach. On the other hand, the abundance of protein sequences stemming from next generation sequencing technologies, lacking structural annotations, presents an additional challenge to the scientific community. Consequently, 'pre-annotating' protein sequences with structural features, ensuring a high level of precision, could prove highly advantageous. In this paper, after a thorough investigation, we introduce a novel machine-learning model capable of classifying any protein domain, whether it has a known structure or not, into one of the 40 main CATH Architectures. We achieve an F1 Score of 0.92 using only the amino acid sequence and a score of 0.94 using both the sequence of amino acids and the sequence of structural alphabets.Communicated by Ramaswamy H. Sarma.

Palavras-chave

CATH; SCOP; k-mer; machine learning; protein blocks; protein's architecture; protein's structure; structural alphabets

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links