Your browser doesn't support javascript.
loading
Evaluation and optimization of sequence-based gene regulatory deep learning models.
Rafi, Abdul Muntakim; Nogina, Daria; Penzar, Dmitry; Lee, Dohoon; Lee, Danyeong; Kim, Nayeon; Kim, Sangyeup; Kim, Dohyeon; Shin, Yeojin; Kwak, Il-Youp; Meshcheryakov, Georgy; Lando, Andrey; Zinkevich, Arsenii; Kim, Byeong-Chan; Lee, Juhyun; Kang, Taein; Vaishnav, Eeshit Dhaval; Yadollahpour, Payman; Kim, Sun; Albrecht, Jake; Regev, Aviv; Gong, Wuming; Kulakovskiy, Ivan V; Meyer, Pablo; de Boer, Carl.
  • Rafi AM; University of British Columbia, Vancouver, BC, Canada.
  • Nogina D; Lomonosov Moscow State University, Moscow, Russia.
  • Penzar D; Lomonosov Moscow State University, Moscow, Russia.
  • Lee D; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.
  • Lee D; Seoul National University, Seoul, South Korea.
  • Kim N; Seoul National University, Seoul, South Korea.
  • Kim S; Seoul National University, Seoul, South Korea.
  • Kim D; Seoul National University, Seoul, South Korea.
  • Shin Y; Seoul National University, Seoul, South Korea.
  • Kwak IY; Seoul National University, Seoul, South Korea.
  • Meshcheryakov G; Chung-Ang University, Seoul, South Korea.
  • Lando A; Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia.
  • Zinkevich A; Yandex N. V., Moscow, Russia.
  • Kim BC; Lomonosov Moscow State University, Moscow, Russia.
  • Lee J; Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia.
  • Kang T; Chung-Ang University, Seoul, South Korea.
  • Vaishnav ED; Chung-Ang University, Seoul, South Korea.
  • Yadollahpour P; Chung-Ang University, Seoul, South Korea.
  • Kim S; Broad Institute of MIT and Harvard, Massachusetts, United States.
  • Regev A; Seoul National University, Seoul, South Korea.
  • Gong W; Sage Bionetworks.
  • Kulakovskiy IV; Broad Institute of MIT and Harvard, Massachusetts, United States.
  • Meyer P; Genentech, South San Francisco, CA, USA.
  • de Boer C; University of Minnesota, Minneapolis, United States.
bioRxiv ; 2024 Feb 17.
Article en En | MEDLINE | ID: mdl-38405704
ABSTRACT
Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. While some benchmarks produced similar results across the top-performing models, others differed substantially. All top-performing models used neural networks, but diverged in architectures and novel training strategies, tailored to genomics sequence data. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide any given model into logically equivalent building blocks. We tested all possible combinations for the top three models and observed performance improvements for each. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets. Overall, we demonstrate that high-quality gold-standard genomics datasets can drive significant progress in model development.