ProGen2: Exploring the boundaries of protein language models.

Nijkamp, Erik; Ruffolo, Jeffrey A; Weinstein, Eli N; Naik, Nikhil; Madani, Ali

Nijkamp, Erik; Ruffolo, Jeffrey A; Weinstein, Eli N; Naik, Nikhil; Madani, Ali.

Afiliação

Nijkamp E; Salesforce Research, Palo Alto, CA, USA.
Ruffolo JA; Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA; Profluent Bio, Berkeley, CA, USA.
Weinstein EN; Data Science Institute, Columbia University, New York, NY, USA.
Naik N; Salesforce Research, Palo Alto, CA, USA.
Madani A; Salesforce Research, Palo Alto, CA, USA; Profluent Bio, Berkeley, CA, USA. Electronic address: ali@profluent.bio.

Cell Syst ; 14(11): 968-978.e3, 2023 11 15.

Article em En | MEDLINE | ID: mdl-37909046

ABSTRACT

ABSTRACT

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.

Assuntos

Inteligência Artificial; Proteínas; Proteínas/genética; Sequência de Aminoácidos; Idioma; Bases de Dados Factuais

Palavras-chave

fitness prediction; language modeling; protein design

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Inteligência Artificial / Proteínas Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google