Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 15 de 15
Filter
1.
Nucleic Acids Res ; 46(12): e72, 2018 07 06.
Article in English | MEDLINE | ID: mdl-29617876

ABSTRACT

Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.


Subject(s)
Sequence Analysis, DNA/methods , Transcription Factors/metabolism , Binding Sites , Chromatin Immunoprecipitation , DNA/chemistry , DNA/metabolism , Humans , Machine Learning , Position-Specific Scoring Matrices
2.
Bioinformatics ; 34(7): 1164-1173, 2018 04 01.
Article in English | MEDLINE | ID: mdl-29186331

ABSTRACT

Motivation: Finding computationally drug-target interactions (DTIs) is a convenient strategy to identify new DTIs at low cost with reasonable accuracy. However, the current DTI prediction methods suffer the high false positive prediction rate. Results: We developed DDR, a novel method that improves the DTI prediction accuracy. DDR is based on the use of a heterogeneous graph that contains known DTIs with multiple similarities between drugs and multiple similarities between target proteins. DDR applies non-linear similarity fusion method to combine different similarities. Before fusion, DDR performs a pre-processing step where a subset of similarities is selected in a heuristic process to obtain an optimized combination of similarities. Then, DDR applies a random forest model using different graph-based features extracted from the DTI heterogeneous graph. Using 5-repeats of 10-fold cross-validation, three testing setups, and the weighted average of area under the precision-recall curve (AUPR) scores, we show that DDR significantly reduces the AUPR score error relative to the next best start-of-the-art method for predicting DTIs by 34% when the drugs are new, by 23% when targets are new and by 34% when the drugs and the targets are known but not all DTIs between them are not known. Using independent sources of evidence, we verify as correct 22 out of the top 25 DDR novel predictions. This suggests that DDR can be used as an efficient method to identify correct DTIs. Availability and implementation: The data and code are provided at https://bitbucket.org/RSO24/ddr/. Contact: vladimir.bajic@kaust.edu.sa. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology/methods , Drug Interactions , Machine Learning , Area Under Curve , Humans
3.
Nucleic Acids Res ; 45(8): e58, 2017 05 05.
Article in English | MEDLINE | ID: mdl-28053124

ABSTRACT

Comparing histone modification profiles between cancer and normal states, or across different tumor samples, can provide insights into understanding cancer initiation, progression and response to therapy. ChIP-seq histone modification data of cancer samples are distorted by copy number variation innate to any cancer cell. We present HMCan-diff, the first method designed to analyze ChIP-seq data to detect changes in histone modifications between two cancer samples of different genetic backgrounds, or between a cancer sample and a normal control. HMCan-diff explicitly corrects for copy number bias, and for other biases in the ChIP-seq data, which significantly improves prediction accuracy compared to methods that do not consider such corrections. On in silico simulated ChIP-seq data generated using genomes with differences in copy number profiles, HMCan-diff shows a much better performance compared to other methods that have no correction for copy number bias. Additionally, we benchmarked HMCan-diff on four experimental datasets, characterizing two histone marks in two different scenarios. We correlated changes in histone modifications between a cancer and a normal control sample with changes in gene expression. On all experimental datasets, HMCan-diff demonstrated better performance compared to the other methods.


Subject(s)
Gene Expression Regulation, Neoplastic , Histone Code , Histones/genetics , Neoplasms/genetics , Software , Algorithms , Chromatin Immunoprecipitation , Datasets as Topic , Disease Progression , Gene Dosage , Histones/metabolism , Humans , Markov Chains , Neoplasms/metabolism , Neoplasms/pathology
4.
Nucleic Acids Res ; 45(5): 2838-2848, 2017 03 17.
Article in English | MEDLINE | ID: mdl-27924038

ABSTRACT

Non-coding RNA (ncRNA) genes play a major role in control of heterogeneous cellular behavior. Yet, their functions are largely uncharacterized. Current available databases lack in-depth information of ncRNA functions across spectrum of various cells/tissues. Here, we present FARNA, a knowledgebase of inferred functions of 10,289 human ncRNA transcripts (2,734 microRNA and 7,555 long ncRNA) in 119 tissues and 177 primary cells of human. Since transcription factors (TFs) and TF co-factors (TcoFs) are crucial components of regulatory machinery for activation of gene transcription, cellular processes and diseases in which TFs and TcoFs are involved suggest functions of the transcripts they regulate. In FARNA, functions of a transcript are inferred from TFs and TcoFs whose genes co-express with the transcript controlled by these TFs and TcoFs in a considered cell/tissue. Transcripts were annotated using statistically enriched GO terms, pathways and diseases across cells/tissues based on guilt-by-association principle. Expression profiles across cells/tissues based on Cap Analysis of Gene Expression (CAGE) are provided. FARNA, having the most comprehensive function annotation of considered ncRNAs across widest spectrum of human cells/tissues, has a potential to greatly contribute to our understanding of ncRNA roles and their regulatory mechanisms in human. FARNA can be accessed at: http://cbrc.kaust.edu.sa/farna.


Subject(s)
Databases, Nucleic Acid , Knowledge Bases , MicroRNAs/physiology , RNA, Long Noncoding/physiology , Humans , MicroRNAs/metabolism , RNA, Long Noncoding/metabolism , Transcription Factors/metabolism
5.
Nucleic Acids Res ; 44(D1): D116-25, 2016 Jan 04.
Article in English | MEDLINE | ID: mdl-26586801

ABSTRACT

Models of transcription factor (TF) binding sites provide a basis for a wide spectrum of studies in regulatory genomics, from reconstruction of regulatory networks to functional annotation of transcripts and sequence variants. While TFs may recognize different sequence patterns in different conditions, it is pragmatic to have a single generic model for each particular TF as a baseline for practical applications. Here we present the expanded and enhanced version of HOCOMOCO (http://hocomoco.autosome.ru and http://www.cbrc.kaust.edu.sa/hocomoco10), the collection of models of DNA patterns, recognized by transcription factors. HOCOMOCO now provides position weight matrix (PWM) models for binding sites of 601 human TFs and, in addition, PWMs for 396 mouse TFs. Furthermore, we introduce the largest up to date collection of dinucleotide PWM models for 86 (52) human (mouse) TFs. The update is based on the analysis of massive ChIP-Seq and HT-SELEX datasets, with the validation of the resulting models on in vivo data. To facilitate a practical application, all HOCOMOCO models are linked to gene and protein databases (Entrez Gene, HGNC, UniProt) and accompanied by precomputed score thresholds. Finally, we provide command-line tools for PWM and diPWM threshold estimation and motif finding in nucleotide sequences.


Subject(s)
Databases, Genetic , Regulatory Elements, Transcriptional , Transcription Factors/metabolism , Animals , Binding Sites , Chromatin Immunoprecipitation , Humans , Mice , Models, Biological , Sequence Analysis, DNA
7.
Bioinformatics ; 29(23): 2979-86, 2013 Dec 01.
Article in English | MEDLINE | ID: mdl-24021381

ABSTRACT

MOTIVATION: Cancer cells are often characterized by epigenetic changes, which include aberrant histone modifications. In particular, local or regional epigenetic silencing is a common mechanism in cancer for silencing expression of tumor suppressor genes. Though several tools have been created to enable detection of histone marks in ChIP-seq data from normal samples, it is unclear whether these tools can be efficiently applied to ChIP-seq data generated from cancer samples. Indeed, cancer genomes are often characterized by frequent copy number alterations: gains and losses of large regions of chromosomal material. Copy number alterations may create a substantial statistical bias in the evaluation of histone mark signal enrichment and result in underdetection of the signal in the regions of loss and overdetection of the signal in the regions of gain. RESULTS: We present HMCan (Histone modifications in cancer), a tool specially designed to analyze histone modification ChIP-seq data produced from cancer genomes. HMCan corrects for the GC-content and copy number bias and then applies Hidden Markov Models to detect the signal from the corrected data. On simulated data, HMCan outperformed several commonly used tools developed to analyze histone modification data produced from genomes without copy number alterations. HMCan also showed superior results on a ChIP-seq dataset generated for the repressive histone mark H3K27me3 in a bladder cancer cell line. HMCan predictions matched well with experimental data (qPCR validated regions) and included, for example, the previously detected H3K27me3 mark in the promoter of the DLEC1 gene, missed by other tools we tested.


Subject(s)
Chromatin Assembly and Disassembly/genetics , Chromatin Immunoprecipitation/methods , Epigenesis, Genetic , Histones/genetics , Protein Processing, Post-Translational , Software , Urinary Bladder Neoplasms/genetics , Base Composition , Computer Simulation , DNA Copy Number Variations/genetics , Genome, Human , Histones/metabolism , Humans , Markov Chains , Oligonucleotide Array Sequence Analysis , Promoter Regions, Genetic/genetics , Urinary Bladder Neoplasms/diagnosis
8.
Bioinformatics ; 29(1): 117-8, 2013 Jan 01.
Article in English | MEDLINE | ID: mdl-23110968

ABSTRACT

SUMMARY: In higher eukaryotes, the identification of translation initiation sites (TISs) has been focused on finding these signals in cDNA or mRNA sequences. Using Arabidopsis thaliana (A.t.) information, we developed a prediction tool for signals within genomic sequences of plants that correspond to TISs. Our tool requires only genome sequence, not expressed sequences. Its sensitivity/specificity is for A.t. (90.75%/92.2%), for Vitis vinifera (66.8%/94.4%) and for Populus trichocarpa (81.6%/94.4%), which suggests that our tool can be used in annotation of different plant genomes. We provide a list of features used in our model. Further study of these features may improve our understanding of mechanisms of the translation initiation. AVAILABILITY AND IMPLEMENTATION: Our tool is implemented as an artificial neural network. It is available as a web-based tool and, together with the source code, the list of features, and data used for model development, is accessible at http://cbrc.kaust.edu.sa/dts.


Subject(s)
Arabidopsis/genetics , Peptide Chain Initiation, Translational , Software , Genome, Plant , Genomics , Internet , Neural Networks, Computer , Nucleotide Motifs , Sensitivity and Specificity , Sequence Analysis, DNA
9.
EBioMedicine ; 93: 104686, 2023 Jul.
Article in English | MEDLINE | ID: mdl-37379654

ABSTRACT

BACKGROUND: Individual plasma proteins have been identified as minimally invasive biomarkers for lung cancer diagnosis with potential utility in early detection. Plasma proteomes provide insight into contributing biological factors; we investigated their potential for future lung cancer prediction. METHODS: The Olink® Explore-3072 platform quantitated 2941 proteins in 496 Liverpool Lung Project plasma samples, including 131 cases taken 1-10 years prior to diagnosis, 237 controls, and 90 subjects at multiple times. 1112 proteins significantly associated with haemolysis were excluded. Feature selection with bootstrapping identified differentially expressed proteins, subsequently modelled for lung cancer prediction and validated in UK Biobank data. FINDINGS: For samples 1-3 years pre-diagnosis, 240 proteins were significantly different in cases; for 1-5 year samples, 117 of these and 150 further proteins were identified, mapping to significantly different pathways. Four machine learning algorithms gave median AUCs of 0.76-0.90 and 0.73-0.83 for the 1-3 year and 1-5 year proteins respectively. External validation gave AUCs of 0.75 (1-3 year) and 0.69 (1-5 year), with AUC 0.7 up to 12 years prior to diagnosis. The models were independent of age, smoking duration, cancer histology and the presence of COPD. INTERPRETATION: The plasma proteome provides biomarkers which may be used to identify those at greatest risk of lung cancer. The proteins and the pathways are different when lung cancer is more imminent, indicating that both biomarkers of inherent risk and biomarkers associated with presence of early lung cancer may be identified. FUNDING: Janssen Pharmaceuticals Research Collaboration Award; Roy Castle Lung Cancer Foundation.


Subject(s)
Biomarkers, Tumor , Lung Neoplasms , Humans , Biomarkers, Tumor/metabolism , Early Detection of Cancer , Lung Neoplasms/diagnosis , Biomarkers , Blood Proteins , Smoking , Proteome
10.
Sci Rep ; 11(1): 376, 2021 01 11.
Article in English | MEDLINE | ID: mdl-33432081

ABSTRACT

Intra-tumoral epigenetic heterogeneity is an indicator of tumor population fitness and is linked to the deregulation of transcription. However, there is no published computational tool to automate the measurement of intra-tumoral epigenetic allelic heterogeneity. We developed an R/Bioconductor package, epihet, to calculate the intra-tumoral epigenetic heterogeneity and to perform differential epigenetic heterogeneity analysis. Furthermore, epihet can implement a biological network analysis workflow for transforming cancer-specific differential epigenetic heterogeneity loci into cancer-related biological function and clinical biomarkers. Finally, we demonstrated epihet utility on acute myeloid leukemia. We found statistically significant differential epigenetic heterogeneity (DEH) loci compared to normal controls and constructed co-epigenetic heterogeneity network and modules. epihet is available at https://bioconductor.org/packages/release/bioc/html/epihet.html .


Subject(s)
Epigenesis, Genetic/physiology , Genetic Heterogeneity , Neoplasms/genetics , Software , Computational Biology/methods , Epigenomics/methods , Gene Expression Profiling/methods , Gene Expression Regulation, Neoplastic , Gene Regulatory Networks , Humans , Neoplasms/pathology , Polymorphism, Genetic , Tumor Microenvironment/genetics
11.
J Cheminform ; 12(1): 44, 2020 Jun 29.
Article in English | MEDLINE | ID: mdl-33431036

ABSTRACT

In silico prediction of drug-target interactions is a critical phase in the sustainable drug development process, especially when the research focus is to capitalize on the repositioning of existing drugs. However, developing such computational methods is not an easy task, but is much needed, as current methods that predict potential drug-target interactions suffer from high false-positive rates. Here we introduce DTiGEMS+, a computational method that predicts Drug-Target interactions using Graph Embedding, graph Mining, and Similarity-based techniques. DTiGEMS+ combines similarity-based as well as feature-based approaches, and models the identification of novel drug-target interactions as a link prediction problem in a heterogeneous network. DTiGEMS+ constructs the heterogeneous network by augmenting the known drug-target interactions graph with two other complementary graphs namely: drug-drug similarity, target-target similarity. DTiGEMS+ combines different computational techniques to provide the final drug target prediction, these techniques include graph embeddings, graph mining, and machine learning. DTiGEMS+ integrates multiple drug-drug similarities and target-target similarities into the final heterogeneous graph construction after applying a similarity selection procedure as well as a similarity fusion algorithm. Using four benchmark datasets, we show DTiGEMS+ substantially improves prediction performance compared to other state-of-the-art in silico methods developed to predict of drug-target interactions by achieving the highest average AUPR across all datasets (0.92), which reduces the error rate by 33.3% relative to the second-best performing model in the state-of-the-art methods comparison.

12.
Nat Commun ; 11(1): 1173, 2020 03 03.
Article in English | MEDLINE | ID: mdl-32127534

ABSTRACT

Chromatin interaction studies can reveal how the genome is organized into spatially confined sub-compartments in the nucleus. However, accurately identifying sub-compartments from chromatin interaction data remains a challenge in computational biology. Here, we present Sub-Compartment Identifier (SCI), an algorithm that uses graph embedding followed by unsupervised learning to predict sub-compartments using Hi-C chromatin interaction data. We find that the network topological centrality and clustering performance of SCI sub-compartment predictions are superior to those of hidden Markov model (HMM) sub-compartment predictions. Moreover, using orthogonal Chromatin Interaction Analysis by in-situ Paired-End Tag Sequencing (ChIA-PET) data, we confirmed that SCI sub-compartment prediction outperforms HMM. We show that SCI-predicted sub-compartments have distinct epigenetic marks, transcriptional activities, and transcription factor enrichment. Moreover, we present a deep neural network to predict sub-compartments using epigenome, replication timing, and sequence data. Our neural network predicts more accurate sub-compartment predictions when SCI-determined sub-compartments are used as labels for training.


Subject(s)
Chromatin/genetics , Computer Graphics , Genomics/methods , Neural Networks, Computer , Algorithms , Chromatin/metabolism , Cluster Analysis , Data Analysis , Epigenome , Gene Expression , Humans , K562 Cells , Markov Chains , Reproducibility of Results , Unsupervised Machine Learning
13.
Genomics Proteomics Bioinformatics ; 16(5): 332-341, 2018 10.
Article in English | MEDLINE | ID: mdl-30578915

ABSTRACT

In mammalian cells, transcribed enhancers (TrEns) play important roles in the initiation of gene expression and maintenance of gene expression levels in a spatiotemporal manner. One of the most challenging questions is how the genomic characteristics of enhancers relate to enhancer activities. To date, only a limited number of enhancer sequence characteristics have been investigated, leaving space for exploring the enhancers' DNA code in a more systematic way. To address this problem, we developed a novel computational framework, Transcribed Enhancer Landscape Search (TELS), aimed at identifying predictive cell type/tissue-specific motif signatures of TrEns. As a case study, we used TELS to compile a comprehensive catalog of motif signatures for all known TrEns identified by the FANTOM5 consortium across 112 human primary cells and tissues. Our results confirm that combinations of different short motifs characterize in an optimized manner cell type/tissue-specific TrEns. Our study is the first to report combinations of motifs that maximize classification performance of TrEns exclusively transcribed in one cell type/tissue from TrEns exclusively transcribed in different cell types/tissues. Moreover, we also report 31 motif signatures predictive of enhancers' broad activity. TELS codes and material are publicly available at http://www.cbrc.kaust.edu.sa/TELS.


Subject(s)
Enhancer Elements, Genetic , Sequence Analysis, DNA/methods , Transcription, Genetic , Genomics/methods , Humans , Nucleotide Motifs
14.
Nat Biotechnol ; 35(9): 872-878, 2017 Sep.
Article in English | MEDLINE | ID: mdl-28829439

ABSTRACT

MicroRNAs (miRNAs) are short non-coding RNAs with key roles in cellular regulation. As part of the fifth edition of the Functional Annotation of Mammalian Genome (FANTOM5) project, we created an integrated expression atlas of miRNAs and their promoters by deep-sequencing 492 short RNA (sRNA) libraries, with matching Cap Analysis Gene Expression (CAGE) data, from 396 human and 47 mouse RNA samples. Promoters were identified for 1,357 human and 804 mouse miRNAs and showed strong sequence conservation between species. We also found that primary and mature miRNA expression levels were correlated, allowing us to use the primary miRNA measurements as a proxy for mature miRNA levels in a total of 1,829 human and 1,029 mouse CAGE libraries. We thus provide a broad atlas of miRNA expression and promoters in primary mammalian cells, establishing a foundation for detailed analysis of miRNA expression patterns and transcriptional control regions.


Subject(s)
Gene Expression Profiling/methods , MicroRNAs/genetics , Molecular Sequence Annotation , Promoter Regions, Genetic/genetics , Animals , Cells, Cultured , Gene Library , High-Throughput Nucleotide Sequencing , Humans , Mice , MicroRNAs/metabolism
15.
Article in English | MEDLINE | ID: mdl-26342387

ABSTRACT

Enhancers are cis-acting DNA regulatory regions that play a key role in distal control of transcriptional activities. Identification of enhancers, coupled with a comprehensive functional analysis of their properties, could improve our understanding of complex gene transcription mechanisms and gene regulation processes in general. We developed DENdb, a centralized on-line repository of predicted enhancers derived from multiple human cell-lines. DENdb integrates enhancers predicted by five different methods generating an enriched catalogue of putative enhancers for each of the analysed cell-lines. DENdb provides information about the overlap of enhancers with DNase I hypersensitive regions, ChIP-seq regions of a number of transcription factors and transcription factor binding motifs, means to explore enhancer interactions with DNA using several chromatin interaction assays and enhancer neighbouring genes. DENdb is designed as a relational database that facilitates fast and efficient searching, browsing and visualization of information. Database URL: http://www.cbrc.kaust.edu.sa/dendb/.


Subject(s)
Databases, Nucleic Acid , Nucleotide Motifs , Response Elements , Humans , Transcription Factors/genetics , Transcription Factors/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL