Búsqueda | Biblioteca Virtual en Salud

Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences.

Kwak, Il-Youp; Kim, Byeong-Chan; Lee, Juhyun; Kang, Taein; Garry, Daniel J; Zhang, Jianyi; Gong, Wuming.

BMC Bioinformatics ; 25(1): 81, 2024 Feb 20.

Artículo en Inglés | MEDLINE | ID: mdl-38378442

RESUMEN

The breakthrough high-throughput measurement of the cis-regulatory activity of millions of randomly generated promoters provides an unprecedented opportunity to systematically decode the cis-regulatory logic that determines the expression values. We developed an end-to-end transformer encoder architecture named Proformer to predict the expression values from DNA sequences. Proformer used a Macaron-like Transformer encoder architecture, where two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer. The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learned positional embedding and strand embedding (forward strand vs. reverse complemented strand) as the sequence input. Moreover, Proformer introduced multiple expression heads with mask filling to prevent the transformer models from collapsing when training on relatively small amount of data. We empirically determined that this design had significantly better performance than the conventional design such as using the global pooling layer as the output layer for the regression task. These analyses support the notion that Proformer provides a novel method of learning and enhances our understanding of how cis-regulatory sequences determine the expression values.

Asunto(s)

Suministros de Energía Eléctrica , Aprendizaje , Regiones Promotoras Genéticas

Evaluation and optimization of sequence-based gene regulatory deep learning models.

Rafi, Abdul Muntakim; Nogina, Daria; Penzar, Dmitry; Lee, Dohoon; Lee, Danyeong; Kim, Nayeon; Kim, Sangyeup; Kim, Dohyeon; Shin, Yeojin; Kwak, Il-Youp; Meshcheryakov, Georgy; Lando, Andrey; Zinkevich, Arsenii; Kim, Byeong-Chan; Lee, Juhyun; Kang, Taein; Vaishnav, Eeshit Dhaval; Yadollahpour, Payman; Kim, Sun; Albrecht, Jake; Regev, Aviv; Gong, Wuming; Kulakovskiy, Ivan V; Meyer, Pablo; de Boer, Carl.

bioRxiv ; 2024 Feb 17.

Artículo en Inglés | MEDLINE | ID: mdl-38405704

RESUMEN

Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. While some benchmarks produced similar results across the top-performing models, others differed substantially. All top-performing models used neural networks, but diverged in architectures and novel training strategies, tailored to genomics sequence data. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide any given model into logically equivalent building blocks. We tested all possible combinations for the top three models and observed performance improvements for each. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets. Overall, we demonstrate that high-quality gold-standard genomics datasets can drive significant progress in model development.

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

Detalles de la búsqueda