Pesquisa | BVS Doenças Infecciosas e Parasitárias

1.

Enhancer target prediction: state-of-the-art approaches and future prospects.

Umarov, Ramzan; Hon, Chung-Chau.

Biochem Soc Trans ; 51(5): 1975-1988, 2023 10 31.

Artigo em Inglês | MEDLINE | ID: mdl-37830459

RESUMO

Enhancers are genomic regions that regulate gene transcription and are located far away from the transcription start sites of their target genes. Enhancers are highly enriched in disease-associated variants and thus deciphering the interactions between enhancers and genes is crucial to understanding the molecular basis of genetic predispositions to diseases. Experimental validations of enhancer targets can be laborious. Computational methods have thus emerged as a valuable alternative for studying enhancer-gene interactions. A variety of computational methods have been developed to predict enhancer targets by incorporating genomic features (e.g. conservation, distance, and sequence), epigenomic features (e.g. histone marks and chromatin contacts) and activity measurements (e.g. covariations of enhancer activity and gene expression). With the recent advances in genome perturbation and chromatin conformation capture technologies, data on experimentally validated enhancer targets are becoming available for supervised training of these methods and evaluation of their performance. In this review, we categorize enhancer target prediction methods based on their rationales and approaches. Then we discuss their merits and limitations and highlight the future directions for enhancer targets prediction.

Assuntos

Elementos Facilitadores Genéticos , Histonas , Histonas/metabolismo , Cromatina , Genômica/métodos , Epigenômica

2.

DeepCellState: An autoencoder-based framework for predicting cell type specific transcriptional states induced by drug treatment.

Umarov, Ramzan; Li, Yu; Arner, Erik.

PLoS Comput Biol ; 17(10): e1009465, 2021 10.

Artigo em Inglês | MEDLINE | ID: mdl-34610009

RESUMO

Drug treatment induces cell type specific transcriptional programs, and as the number of combinations of drugs and cell types grows, the cost for exhaustive screens measuring the transcriptional drug response becomes intractable. We developed DeepCellState, a deep learning autoencoder-based framework, for predicting the induced transcriptional state in a cell type after drug treatment, based on the drug response in another cell type. Training the method on a large collection of transcriptional drug perturbation profiles, prediction accuracy improves significantly over baseline and alternative deep learning approaches when applying the method to two cell types, with improved accuracy when generalizing the framework to additional cell types. Treatments with drugs or whole drug families not seen during training are predicted with similar accuracy, and the same framework can be used for predicting the results from other interventions, such as gene knock-downs. Finally, analysis of the trained model shows that the internal representation is able to learn regulatory relationships between genes in a fully data-driven manner.

Assuntos

Biologia Computacional/métodos , Aprendizado Profundo , Transcriptoma/efeitos dos fármacos , Transcriptoma/genética , Antineoplásicos/farmacologia , Técnicas de Silenciamento de Genes , Humanos , Células MCF-7 , Células PC-3 , Aprendizado de Máquina não Supervisionado

3.

ReFeaFi: Genome-wide prediction of regulatory elements driving transcription initiation.

Umarov, Ramzan; Li, Yu; Arakawa, Takahiro; Takizawa, Satoshi; Gao, Xin; Arner, Erik.

PLoS Comput Biol ; 17(9): e1009376, 2021 09.

Artigo em Inglês | MEDLINE | ID: mdl-34491989

RESUMO

Regulatory elements control gene expression through transcription initiation (promoters) and by enhancing transcription at distant regions (enhancers). Accurate identification of regulatory elements is fundamental for annotating genomes and understanding gene expression patterns. While there are many attempts to develop computational promoter and enhancer identification methods, reliable tools to analyze long genomic sequences are still lacking. Prediction methods often perform poorly on the genome-wide scale because the number of negatives is much higher than that in the training sets. To address this issue, we propose a dynamic negative set updating scheme with a two-model approach, using one model for scanning the genome and the other one for testing candidate positions. The developed method achieves good genome-level performance and maintains robust performance when applied to other vertebrate species, without re-training. Moreover, the unannotated predicted regulatory regions made on the human genome are enriched for disease-associated variants, suggesting them to be potentially true regulatory elements rather than false positives. We validated high scoring "false positive" predictions using reporter assay and all tested candidates were successfully validated, demonstrating the ability of our method to discover novel human regulatory regions.

Assuntos

Aprendizado Profundo , Modelos Genéticos , Sequências Reguladoras de Ácido Nucleico , Iniciação da Transcrição Genética , Biologia Computacional , Elementos Facilitadores Genéticos , Regulação da Expressão Gênica , Genes Reporter , Genoma Humano , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Humanos , Anotação de Sequência Molecular , Mutação , Regiões Promotoras Genéticas

4.

Promoter analysis and prediction in the human genome using sequence-based deep learning models.

Umarov, Ramzan; Kuwahara, Hiroyuki; Li, Yu; Gao, Xin; Solovyev, Victor.

Bioinformatics ; 35(16): 2730-2737, 2019 08 15.

Artigo em Inglês | MEDLINE | ID: mdl-30601980

RESUMO

MOTIVATION: Computational identification of promoters is notoriously difficult as human genes often have unique promoter sequences that provide regulation of transcription and interaction with transcription initiation complex. While there are many attempts to develop computational promoter identification methods, we have no reliable tool to analyze long genomic sequences. RESULTS: In this work, we further develop our deep learning approach that was relatively successful to discriminate short promoter and non-promoter sequences. Instead of focusing on the classification accuracy, in this work we predict the exact positions of the transcription start site inside the genomic sequences testing every possible location. We studied human promoters to find effective regions for discrimination and built corresponding deep learning models. These models use adaptively constructed negative set, which iteratively improves the model's discriminative ability. Our method significantly outperforms the previously developed promoter prediction programs by considerably reducing the number of false-positive predictions. We have achieved error-per-1000-bp rate of 0.02 and have 0.31 errors per correct prediction, which is significantly better than the results of other human promoter predictors. AVAILABILITY AND IMPLEMENTATION: The developed method is available as a web server at http://www.cbrc.kaust.edu.sa/PromID/.

Assuntos

Aprendizado Profundo , Regiões Promotoras Genéticas , Genoma Humano , Genômica , Humanos , Sítio de Iniciação de Transcrição

5.

DEEPre: sequence-based enzyme EC number prediction by deep learning.

Li, Yu; Wang, Sheng; Umarov, Ramzan; Xie, Bingqing; Fan, Ming; Li, Lihua; Gao, Xin.

Bioinformatics ; 34(5): 760-769, 2018 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-29069344

RESUMO

Motivation: Annotation of enzyme function has a broad range of applications, such as metagenomics, industrial biotechnology, and diagnosis of enzyme deficiency-caused diseases. However, the time and resource required make it prohibitively expensive to experimentally determine the function of every enzyme. Therefore, computational enzyme function prediction has become increasingly important. In this paper, we develop such an approach, determining the enzyme function by predicting the Enzyme Commission number. Results: We propose an end-to-end feature selection and classification model training approach, as well as an automatic and robust feature dimensionality uniformization method, DEEPre, in the field of enzyme function prediction. Instead of extracting manually crafted features from enzyme sequences, our model takes the raw sequence encoding as inputs, extracting convolutional and sequential features from the raw encoding based on the classification result to directly improve the prediction performance. The thorough cross-fold validation experiments conducted on two large-scale datasets show that DEEPre improves the prediction performance over the previous state-of-the-art methods. In addition, our server outperforms five other servers in determining the main class of enzymes on a separate low-homology dataset. Two case studies demonstrate DEEPre's ability to capture the functional difference of enzyme isoforms. Availability and implementation: The server could be accessed freely at http://www.cbrc.kaust.edu.sa/DEEPre. Contact: xin.gao@kaust.edu.sa. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Biologia Computacional/métodos , Enzimas/metabolismo , Aprendizado de Máquina , Anotação de Sequência Molecular/métodos , Humanos , Software

6.

TSSPlant: a new tool for prediction of plant Pol II promoters.

Shahmuradov, Ilham A; Umarov, Ramzan Kh; Solovyev, Victor V.

Nucleic Acids Res ; 45(8): e65, 2017 05 05.

Artigo em Inglês | MEDLINE | ID: mdl-28082394

RESUMO

Our current knowledge of eukaryotic promoters indicates their complex architecture that is often composed of numerous functional motifs. Most of known promoters include multiple and in some cases mutually exclusive transcription start sites (TSSs). Moreover, TSS selection depends on cell/tissue, development stage and environmental conditions. Such complex promoter structures make their computational identification notoriously difficult. Here, we present TSSPlant, a novel tool that predicts both TATA and TATA-less promoters in sequences of a wide spectrum of plant genomes. The tool was developed by using large promoter collections from ppdb and PlantProm DB. It utilizes eighteen significant compositional and signal features of plant promoter sequences selected in this study, that feed the artificial neural network-based model trained by the backpropagation algorithm. TSSPlant achieves significantly higher accuracy compared to the next best promoter prediction program for both TATA promoters (MCC≃0.84 and F1-score≃0.91 versus MCC≃0.51 and F1-score≃0.71) and TATA-less promoters (MCC≃0.80, F1-score≃0.89 versus MCC≃0.29 and F1-score≃0.50). TSSPlant is available to download as a standalone program at http://www.cbrc.kaust.edu.sa/download/.

Assuntos

Genoma de Planta , Redes Neurais de Computação , Proteínas de Plantas/genética , Regiões Promotoras Genéticas , RNA Polimerase II/genética , Sítio de Iniciação de Transcrição , Arabidopsis/genética , Arabidopsis/metabolismo , Expressão Gênica , Oryza/genética , Oryza/metabolismo , Proteínas de Plantas/metabolismo , RNA Polimerase II/metabolismo , Análise de Sequência de DNA , Software

7.

Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape.

Dai, Hanjun; Umarov, Ramzan; Kuwahara, Hiroyuki; Li, Yu; Song, Le; Gao, Xin.

Bioinformatics ; 33(22): 3575-3583, 2017 Nov 15.

Artigo em Inglês | MEDLINE | ID: mdl-28961686

RESUMO

MOTIVATION: An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. RESULTS: Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods. AVAILABILITY AND IMPLEMENTATION: Our program is freely available at https://github.com/ramzan1990/sequence2vec. CONTACT: xin.gao@kaust.edu.sa or lsong@cc.gatech.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , DNA/metabolismo , Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Sítios de Ligação , DNA/química , Aprendizado de Máquina , Modelos Estatísticos , Ligação Proteica

8.

HMD-ARG: hierarchical multi-task deep learning for annotating antibiotic resistance genes.

Li, Yu; Xu, Zeling; Han, Wenkai; Cao, Huiluo; Umarov, Ramzan; Yan, Aixin; Fan, Ming; Chen, Huan; Duarte, Carlos M; Li, Lihua; Ho, Pak-Leung; Gao, Xin.

Microbiome ; 9(1): 40, 2021 02 08.

Artigo em Inglês | MEDLINE | ID: mdl-33557954

RESUMO

BACKGROUND: The spread of antibiotic resistance has become one of the most urgent threats to global health, which is estimated to cause 700,000 deaths each year globally. Its surrogates, antibiotic resistance genes (ARGs), are highly transmittable between food, water, animal, and human to mitigate the efficacy of antibiotics. Accurately identifying ARGs is thus an indispensable step to understanding the ecology, and transmission of ARGs between environmental and human-associated reservoirs. Unfortunately, the previous computational methods for identifying ARGs are mostly based on sequence alignment, which cannot identify novel ARGs, and their applications are limited by currently incomplete knowledge about ARGs. RESULTS: Here, we propose an end-to-end Hierarchical Multi-task Deep learning framework for ARG annotation (HMD-ARG). Taking raw sequence encoding as input, HMD-ARG can identify, without querying against existing sequence databases, multiple ARG properties simultaneously, including if the input protein sequence is an ARG, and if so, what antibiotic family it is resistant to, what resistant mechanism the ARG takes, and if the ARG is an intrinsic one or acquired one. In addition, if the predicted antibiotic family is beta-lactamase, HMD-ARG further predicts the subclass of beta-lactamase that the ARG is resistant to. Comprehensive experiments, including cross-fold validation, third-party dataset validation in human gut microbiota, wet-experimental functional validation, and structural investigation of predicted conserved sites, demonstrate not only the superior performance of our method over the state-of-art methods, but also the effectiveness and robustness of the proposed method. CONCLUSIONS: We propose a hierarchical multi-task method, HMD-ARG, which is based on deep learning and can provide detailed annotations of ARGs from three important aspects: resistant antibiotic class, resistant mechanism, and gene mobility. We believe that HMD-ARG can serve as a powerful tool to identify antibiotic resistance genes and, therefore mitigate their global threat. Our method and the constructed database are available at http://www.cbrc.kaust.edu.sa/HMDARG/ . Video abstract (MP4 50984 kb).

Assuntos

Aprendizado Profundo , Resistência Microbiana a Medicamentos/genética , Genes Bacterianos/genética , Animais , Humanos , beta-Lactamases/genética

9.

Analysis of transcript-deleterious variants in Mendelian disorders: implications for RNA-based diagnostics.

Maddirevula, Sateesh; Kuwahara, Hiroyuki; Ewida, Nour; Shamseldin, Hanan E; Patel, Nisha; Alzahrani, Fatema; AlSheddi, Tarfa; AlObeid, Eman; Alenazi, Mona; Alsaif, Hessa S; Alqahtani, Maha; AlAli, Maha; Al Ali, Hatoon; Helaby, Rana; Ibrahim, Niema; Abdulwahab, Firdous; Hashem, Mais; Hanna, Nadine; Monies, Dorota; Derar, Nada; Alsagheir, Afaf; Alhashem, Amal; Alsaleem, Badr; Alhebbi, Hamoud; Wali, Sami; Umarov, Ramzan; Gao, Xin; Alkuraya, Fowzan S.

Genome Biol ; 21(1): 145, 2020 06 17.

Artigo em Inglês | MEDLINE | ID: mdl-32552793

RESUMO

BACKGROUND: At least 50% of patients with suspected Mendelian disorders remain undiagnosed after whole-exome sequencing (WES), and the extent to which non-coding variants that are not captured by WES contribute to this fraction is unclear. Whole transcriptome sequencing is a promising supplement to WES, although empirical data on the contribution of RNA analysis to the diagnosis of Mendelian diseases on a large scale are scarce. RESULTS: Here, we describe our experience with transcript-deleterious variants (TDVs) based on a cohort of 5647 families with suspected Mendelian diseases. We first interrogate all families for which the respective Mendelian phenotype could be mapped to a single locus to obtain an unbiased estimate of the contribution of TDVs at 18.9%. We examine the entire cohort and find that TDVs account for 15% of all "solved" cases. We compare the results of RT-PCR to in silico prediction. Definitive results from RT-PCR are obtained from blood-derived RNA for the overwhelming majority of variants (84.1%), and only a small minority (2.6%) fail analysis on all available RNA sources (blood-, skin fibroblast-, and urine renal epithelial cells-derived), which has important implications for the clinical application of RNA-seq. We also show that RNA analysis can establish the diagnosis in 13.5% of 155 patients who had received "negative" clinical WES reports. Finally, our data suggest a role for TDVs in modulating penetrance even in otherwise highly penetrant Mendelian disorders. CONCLUSIONS: Our results provide much needed empirical data for the impending implementation of diagnostic RNA-seq in conjunction with genome sequencing.

Assuntos

Doenças Genéticas Inatas/diagnóstico , Testes Genéticos/métodos , Análise de Sequência de RNA , Estudos de Coortes , Simulação por Computador , Doenças Genéticas Inatas/epidemiologia , Doenças Genéticas Inatas/genética , Doenças Genéticas Inatas/metabolismo , Humanos , Modelos Genéticos , Arábia Saudita/epidemiologia , Sequenciamento do Exoma

10.

A deep learning framework to predict binding preference of RNA constituents on protein surface.

Lam, Jordy Homing; Li, Yu; Zhu, Lizhe; Umarov, Ramzan; Jiang, Hanlun; Héliou, Amélie; Sheong, Fu Kit; Liu, Tianyun; Long, Yongkang; Li, Yunfei; Fang, Liang; Altman, Russ B; Chen, Wei; Huang, Xuhui; Gao, Xin.

Nat Commun ; 10(1): 4941, 2019 10 30.

Artigo em Inglês | MEDLINE | ID: mdl-31666519

RESUMO

Protein-RNA interaction plays important roles in post-transcriptional regulation. However, the task of predicting these interactions given a protein structure is difficult. Here we show that, by leveraging a deep learning model NucleicNet, attributes such as binding preference of RNA backbone constituents and different bases can be predicted from local physicochemical characteristics of protein structure surface. On a diverse set of challenging RNA-binding proteins, including Fem-3-binding-factor 2, Argonaute 2 and Ribonuclease III, NucleicNet can accurately recover interaction modes discovered by structural biology experiments. Furthermore, we show that, without seeing any in vitro or in vivo assay data, NucleicNet can still achieve consistency with experiments, including RNAcompete, Immunoprecipitation Assay, and siRNA Knockdown Benchmark. NucleicNet can thus serve to provide quantitative fitness of RNA sequences for given binding pockets or to predict potential binding pockets and binding RNAs for previously unknown RNA binding proteins.

Assuntos

Proteínas Argonautas/metabolismo , Aprendizado Profundo , RNA/metabolismo , Ribonuclease III/metabolismo , Adenina/metabolismo , Animais , Área Sob a Curva , Citosina/metabolismo , Técnicas de Silenciamento de Genes , Guanina/metabolismo , Humanos , Camundongos , Fosfatos/metabolismo , Ligação Proteica , RNA Interferente Pequeno , Proteínas de Ligação a RNA/metabolismo , Curva ROC , Ribose/metabolismo , Uracila/metabolismo

11.

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.

Umarov, Ramzan Kh; Solovyev, Victor V.

PLoS One ; 12(2): e0171410, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-28158264

RESUMO

Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.

Assuntos

Células Eucarióticas/metabolismo , Redes Neurais de Computação , Células Procarióticas/metabolismo , Regiões Promotoras Genéticas/genética , Animais , Biologia Computacional/métodos , Humanos , Análise de Sequência de DNA

12.

ACRE: Absolute concentration robustness exploration in module-based combinatorial networks.

Kuwahara, Hiroyuki; Umarov, Ramzan; Almasri, Islam; Gao, Xin.

Synth Biol (Oxf) ; 2(1): ysx001, 2017 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-32995502

RESUMO

To engineer cells for industrial-scale application, a deep understanding of how to design molecular control mechanisms to tightly maintain functional stability under various fluctuations is crucial. Absolute concentration robustness (ACR) is a category of robustness in reaction network models in which the steady-state concentration of a molecular species is guaranteed to be invariant even with perturbations in the other molecular species in the network. Here, we introduce a software tool, absolute concentration robustness explorer (ACRE), which efficiently explores combinatorial biochemical networks for the ACR property. ACRE has a user-friendly interface, and it can facilitate efficient analysis of key structural features that guarantee the presence and the absence of the ACR property from combinatorial networks. Such analysis is expected to be useful in synthetic biology as it can increase our understanding of how to design molecular mechanisms to tightly control the concentration of molecular species. ACRE is freely available at https://github.com/ramzan1990/ACRE.

13.

SBOLme: a Repository of SBOL Parts for Metabolic Engineering.

Kuwahara, Hiroyuki; Cui, Xuefeng; Umarov, Ramzan; Grünberg, Raik; Myers, Chris J; Gao, Xin.

ACS Synth Biol ; 6(4): 732-736, 2017 04 21.

Artigo em Inglês | MEDLINE | ID: mdl-28076956

RESUMO

The Synthetic Biology Open Language (SBOL) is a community-driven open language to promote standardization in synthetic biology. To support the use of SBOL in metabolic engineering, we developed SBOLme, the first open-access repository of SBOL 2-compliant biochemical parts for a wide range of metabolic engineering applications. The URL of our repository is http://www.cbrc.kaust.edu.sa/sbolme .

Assuntos

Engenharia Metabólica , Interface Usuário-Computador , Bases de Dados de Compostos Químicos , Bases de Dados de Proteínas , Enzimas/metabolismo , Linguagens de Programação

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA