Pesquisa | BVS Bolivia

DeepReg: a deep learning hybrid model for predicting transcription factors in eukaryotic and prokaryotic genomes.

Ledesma-Dominguez, Leonardo; Carbajal-Degante, Erik; Moreno-Hagelsieb, Gabriel; Perez-Rueda, Ernesto.

Sci Rep ; 14(1): 9155, 2024 04 21.

Artigo em Inglês | MEDLINE | ID: mdl-38644393

RESUMO

Deep learning models (DLMs) have gained importance in predicting, detecting, translating, and classifying a diversity of inputs. In bioinformatics, DLMs have been used to predict protein structures, transcription factor-binding sites, and promoters. In this work, we propose a hybrid model to identify transcription factors (TFs) among prokaryotic and eukaryotic protein sequences, named Deep Regulation (DeepReg) model. Two architectures were used in the DL model: a convolutional neural network (CNN), and a bidirectional long-short-term memory (BiLSTM). DeepReg reached a precision of 0.99, a recall of 0.97, and an F1-score of 0.98. The quality of our predictions, the bias-variance trade-off approach, and the characterization of new TF predictions were evaluated and compared against those produced by DeepTFactor, as well as against experimental data from three model organisms. Predictions based on our DLM tended to exhibit less variance and bias than those from DeepTFactor, thus increasing reliability and decreasing overfitting.

Assuntos

Aprendizado Profundo , Fatores de Transcrição , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Biologia Computacional/métodos , Células Procarióticas/metabolismo , Redes Neurais de Computação , Eucariotos/genética , Genoma , Células Eucarióticas/metabolismo , Sítios de Ligação

CDBProm: the Comprehensive Directory of Bacterial Promoters.

Martinez, Gustavo Sganzerla; Perez-Rueda, Ernesto; Kumar, Anuj; Dutt, Mansi; Maya, Cinthia Rodríguez; Ledesma-Dominguez, Leonardo; Casa, Pedro Lenz; Kumar, Aditya; de Avila E Silva, Scheila; Kelvin, David J.

NAR Genom Bioinform ; 6(1): lqae018, 2024 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38385146

RESUMO

The decreasing cost of whole genome sequencing has produced high volumes of genomic information that require annotation. The experimental identification of promoter sequences, pivotal for regulating gene expression, is a laborious and cost-prohibitive task. To expedite this, we introduce the Comprehensive Directory of Bacterial Promoters (CDBProm), a directory of in-silico predicted bacterial promoter sequences. We first identified that an Extreme Gradient Boosting (XGBoost) algorithm would distinguish promoters from random downstream regions with an accuracy of 87%. To capture distinctive promoter signals, we generated a second XGBoost classifier trained on the instances misclassified in our first classifier. The predictor of CDBProm is then fed with over 55 million upstream regions from more than 6000 bacterial genomes. Upon finding potential promoter sequences in upstream regions, each promoter is mapped to the genomic data of the organism, linking the predicted promoter with its coding DNA sequence, and identifying the function of the gene regulated by the promoter. The collection of bacterial promoters available in CDBProm enables the quantitative analysis of a plethora of bacterial promoters. Our collection with over 24 million promoters is publicly available at https://aw.iimas.unam.mx/cdbprom/.

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA