Your browser doesn't support javascript.
loading
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.
Zvyagin, Maxim; Brace, Alexander; Hippe, Kyle; Deng, Yuntian; Zhang, Bin; Bohorquez, Cindy Orozco; Clyde, Austin; Kale, Bharat; Perez-Rivera, Danilo; Ma, Heng; Mann, Carla M; Irvin, Michael; Pauloski, J Gregory; Ward, Logan; Hayot-Sasson, Valerie; Emani, Murali; Foreman, Sam; Xie, Zhen; Lin, Diangen; Shukla, Maulik; Nie, Weili; Romero, Josh; Dallago, Christian; Vahdat, Arash; Xiao, Chaowei; Gibbs, Thomas; Foster, Ian; Davis, James J; Papka, Michael E; Brettin, Thomas; Stevens, Rick; Anandkumar, Anima; Vishwanath, Venkatram; Ramanathan, Arvind.
Afiliación
  • Zvyagin M; Argonne National Laboratory.
  • Brace A; Argonne National Laboratory.
  • Hippe K; University of Chicago.
  • Deng Y; Argonne National Laboratory.
  • Zhang B; NVIDIA Inc.
  • Bohorquez CO; Harvard University.
  • Clyde A; Cerebras Inc.
  • Kale B; Cerebras Inc.
  • Perez-Rivera D; Argonne National Laboratory.
  • Ma H; University of Chicago.
  • Mann CM; Northern Illinois University.
  • Irvin M; Argonne National Laboratory.
  • Pauloski JG; New York University.
  • Ward L; Argonne National Laboratory.
  • Hayot-Sasson V; Argonne National Laboratory.
  • Emani M; University of Chicago.
  • Foreman S; Argonne National Laboratory.
  • Xie Z; University of Chicago.
  • Lin D; Argonne National Laboratory.
  • Shukla M; Argonne National Laboratory.
  • Nie W; University of Chicago.
  • Romero J; Argonne National Laboratory.
  • Dallago C; Argonne National Laboratory.
  • Vahdat A; Argonne National Laboratory.
  • Xiao C; Argonne National Laboratory.
  • Gibbs T; University of Chicago.
  • Foster I; Argonne National Laboratory.
  • Davis JJ; University of Chicago.
  • Papka ME; NVIDIA Inc.
  • Brettin T; NVIDIA Inc.
  • Stevens R; NVIDIA Inc.
  • Anandkumar A; Technical University of Munich.
  • Vishwanath V; NVIDIA Inc.
  • Ramanathan A; Arizona State University.
bioRxiv ; 2022 Nov 23.
Article en En | MEDLINE | ID: mdl-36451881
ABSTRACT
We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
Palabras clave

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Tipo de estudio: Prognostic_studies Idioma: En Revista: BioRxiv Año: 2022 Tipo del documento: Article

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Tipo de estudio: Prognostic_studies Idioma: En Revista: BioRxiv Año: 2022 Tipo del documento: Article