Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 68
Filter
1.
Nucleic Acids Res ; 46(D1): D343-D347, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29087517

ABSTRACT

TFClass is a resource that classifies eukaryotic transcription factors (TFs) according to their DNA-binding domains (DBDs), available online at http://tfclass.bioinf.med.uni-goettingen.de. The classification scheme of TFClass was originally derived for human TFs and is expanded here to the whole taxonomic class of mammalia. Combining information from different resources, checking manually the retrieved mammalian TFs sequences and applying extensive phylogenetic analyses, >39 000 TFs from up to 41 mammalian species were assigned to the Superclasses, Classes, Families and Subfamilies of TFClass. As a result, TFClass now provides the corresponding sequence collection in FASTA format, sequence logos and phylogenetic trees at different classification levels, predicted TF binding sites for human, mouse, dog and cow genomes as well as links to several external databases. In particular, all those TFs that are also documented in the TRANSFAC® database (FACTOR table) have been linked and can be freely accessed. TRANSFAC® FACTOR can also be queried through an own search interface.


Subject(s)
Databases, Protein , Transcription Factors/classification , Animals , Binding Sites , Cattle , Dogs , Humans , Mammals , Mice , Phylogeny , Protein Domains , Transcription Factors/chemistry , Transcription Factors/metabolism , User-Computer Interface
2.
Nucleic Acids Res ; 46(D1): D168-D174, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29077896

ABSTRACT

The cell-specific information of transcriptional regulation on microRNAs (miRNAs) is crucial to the precise understanding of gene regulations in various physiological and pathological processes existed in different tissues and cell types. The database, mirTrans, provides comprehensive information about cell-specific transcription of miRNAs including the transcriptional start sites (TSSs) of miRNAs, transcription factor (TF) to miRNA regulations and miRNA promoter sequences. mirTrans also maps the experimental H3K4me3 and DHS (DNase-I hypersensitive site) marks within miRNA promoters and expressed sequence tags (ESTs) within transcribed regions. The current version of database covers 35 259 TSSs and over 2.3 million TF-miRNA regulations for 1513 miRNAs in a total of 54 human cell lines. These cell lines span most of the biological systems, including circulatory system, digestive system and nervous system. Information for both the intragenic miRNAs and intergenic miRNAs is offered. Particularly, the quality of miRNA TSSs and TF-miRNA regulations is evaluated by literature curation. 23 447 TSS records and 2148 TF-miRNA regulations are supported by special experiments as a result of literature curation. EST coverage is also used to evaluate the accuracy of miRNA TSSs. Interface of mirTrans is friendly designed and convenient to make downloads (http://mcube.nju.edu.cn/jwang/lab/soft/mirtrans/ or http://120.27.239.192/mirtrans/).


Subject(s)
Databases, Nucleic Acid , MicroRNAs/genetics , MicroRNAs/metabolism , Cell Line , Expressed Sequence Tags , Gene Expression Regulation , Histone Code , Humans , Promoter Regions, Genetic , Transcription Factors/metabolism , Transcription Initiation Site , User-Computer Interface
3.
BMC Bioinformatics ; 20(Suppl 4): 119, 2019 Apr 18.
Article in English | MEDLINE | ID: mdl-30999858

ABSTRACT

BACKGROUND: The search for molecular biomarkers of early-onset colorectal cancer (CRC) is an important but still quite challenging and unsolved task. Detection of CpG methylation in human DNA obtained from blood or stool has been proposed as a promising approach to a noninvasive early diagnosis of CRC. Thousands of abnormally methylated CpG positions in CRC genomes are often located in non-coding parts of genes. Novel bioinformatic methods are thus urgently needed for multi-omics data analysis to reveal causative biomarkers with a potential driver role in early stages of cancer. METHODS: We have developed a method for finding potential causal relationships between epigenetic changes (DNA methylations) in gene regulatory regions that affect transcription factor binding sites (TFBS) and gene expression changes. This method also considers the topology of the involved signal transduction pathways and searches for positive feedback loops that may cause the carcinogenic aberrations in gene expression. We call this method "Walking pathways", since it searches for potential rewiring mechanisms in cancer pathways due to dynamic changes in the DNA methylation status of important gene regulatory regions ("epigenomic walking"). RESULTS: In this paper, we analysed an extensive collection of full genome gene-expression data (RNA-seq) and DNA methylation data of genomic CpG islands (using Illumina methylation arrays) generated from a sample of tumor and normal gut epithelial tissues of 300 patients with colorectal cancer (at different stages of the disease) (data generated in the EU-supported SysCol project). Identification of potential epigenetic biomarkers of DNA methylation was performed using the fully automatic multi-omics analysis web service "My Genome Enhancer" (MGE) (my-genome-enhancer.com). MGE uses the database on gene regulation TRANSFAC®, the signal transduction pathways database TRANSPATH®, and software that employs AI (artificial intelligence) methods for the analysis of cancer-specific enhancers. CONCLUSIONS: The identified biomarkers underwent experimental testing on an independent set of blood samples from patients with colorectal cancer. As a result, using advanced methods of statistics and machine learning, a minimum set of 6 biomarkers was selected, which together achieve the best cancer detection potential. The markers include hypermethylated positions in regulatory regions of the following genes: CALCA, ENO1, MYC, PDX1, TCF7, ZNF43.


Subject(s)
Biomarkers, Tumor/genetics , Colorectal Neoplasms/genetics , DNA Methylation/genetics , Feedback, Physiological , Signal Transduction/genetics , Binding Sites/genetics , Colorectal Neoplasms/diagnosis , Colorectal Neoplasms/pathology , CpG Islands/genetics , Epigenesis, Genetic , Female , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Humans , Male , Middle Aged , Neoplasm Staging , Transcription Factors/metabolism
4.
Circulation ; 135(19): 1832-1847, 2017 May 09.
Article in English | MEDLINE | ID: mdl-28167635

ABSTRACT

BACKGROUND: Advancing structural and functional maturation of stem cell-derived cardiomyocytes remains a key challenge for applications in disease modeling, drug screening, and heart repair. Here, we sought to advance cardiomyocyte maturation in engineered human myocardium (EHM) toward an adult phenotype under defined conditions. METHODS: We systematically investigated cell composition, matrix, and media conditions to generate EHM from embryonic and induced pluripotent stem cell-derived cardiomyocytes and fibroblasts with organotypic functionality under serum-free conditions. We used morphological, functional, and transcriptome analyses to benchmark maturation of EHM. RESULTS: EHM demonstrated important structural and functional properties of postnatal myocardium, including: (1) rod-shaped cardiomyocytes with M bands assembled as a functional syncytium; (2) systolic twitch forces at a similar level as observed in bona fide postnatal myocardium; (3) a positive force-frequency response; (4) inotropic responses to ß-adrenergic stimulation mediated via canonical ß1- and ß2-adrenoceptor signaling pathways; and (5) evidence for advanced molecular maturation by transcriptome profiling. EHM responded to chronic catecholamine toxicity with contractile dysfunction, cardiomyocyte hypertrophy, cardiomyocyte death, and N-terminal pro B-type natriuretic peptide release; all are classical hallmarks of heart failure. In addition, we demonstrate the scalability of EHM according to anticipated clinical demands for cardiac repair. CONCLUSIONS: We provide proof-of-concept for a universally applicable technology for the engineering of macroscale human myocardium for disease modeling and heart repair from embryonic and induced pluripotent stem cell-derived cardiomyocytes under defined, serum-free conditions.


Subject(s)
Embryonic Stem Cells/transplantation , Heart Failure/therapy , Induced Pluripotent Stem Cells/transplantation , Myocytes, Cardiac/transplantation , Tissue Engineering/methods , Ventricular Remodeling/physiology , Animals , Cell Differentiation/physiology , Embryonic Stem Cells/physiology , Heart Failure/pathology , Humans , Induced Pluripotent Stem Cells/physiology , Myocardium/cytology , Myocardium/pathology , Myocytes, Cardiac/physiology , Printing, Three-Dimensional , Rats , Rats, Nude
5.
Bioinformatics ; 32(16): 2403-10, 2016 08 15.
Article in English | MEDLINE | ID: mdl-27153609

ABSTRACT

MOTIVATION: Identification of microRNA (miRNA) transcriptional start sites (TSSs) is crucial to understand the transcriptional regulation of miRNA. As miRNA expression is highly cell specific, an automatic and systematic method that could identify miRNA TSSs accurately and cell specifically is in urgent requirement. RESULTS: A workflow to identify the TSSs of miRNAs was built by integrating the data of H3K4me3 and DNase I hypersensitive sites as well as combining the conservation level and sequence feature. By applying the workflow to the data for 54 cell lines from the ENCODE project, we successfully identified TSSs for 663 intragenic miRNAs and 620 intergenic miRNAs, which cover 84.2% (1283/1523) of all miRNAs recorded in miRBase 18. For these cell lines, we found 4042 alternative TSSs for intragenic miRNAs and 3186 alternative TSSs for intergenic miRNAs. Our method achieved a better performance than the previous non-cell-specific methods on miRNA TSSs. The cell-specific method developed by Georgakilas et al. gives 158 TSSs of higher accuracy in two cell lines, benefitting from the employment of deep-sequencing technique. In contrast, our method provided a much higher number of miRNA TSSs (7228) for a broader range of cell lines without the limitation of costly deep-sequencing data, thus being more applicable for various experimental cases. Analysis showed that upstream promoters at - 2 kb to - 200 bp of TSS are more conserved for independently transcribed miRNAs, while for miRNAs transcribed with host genes, their core promoters (-200 bp to 200 bp of TSS) are significantly conserved. AVAILABILITY AND IMPLEMENTATION: Predicted miRNA TSSs and promoters can be downloaded from supplementary files. CONTACT: jwang@nju.edu.cn or jlee@nju.edu.cn or edgar.wingender@bioinf.med.uni-goettingen.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
MicroRNAs , Transcription Initiation Site , Gene Expression Regulation , High-Throughput Nucleotide Sequencing , Promoter Regions, Genetic
6.
Nucleic Acids Res ; 43(Database issue): D97-102, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25361979

ABSTRACT

TFClass aims at classifying eukaryotic transcription factors (TFs) according to their DNA-binding domains (DBDs). For this, a classification schema comprising four generic levels (superclass, class, family and subfamily) was defined that could accommodate all known DNA-binding human TFs. They were assigned to their (sub-)families as instances at two different levels, the corresponding TF genes and individual gene products (protein isoforms). In the present version, all mouse and rat orthologs have been linked to the human TFs, and the mouse orthologs have been arranged in an independent ontology. Many TFs were assigned with typical DNA-binding patterns and positional weight matrices derived from high-throughput in-vitro binding studies. Predicted TF binding sites from human gene upstream sequences are now also attached to each human TF whenever a PWM was available for this factor or one of his paralogs. TFClass is freely available at http://tfclass.bioinf.med.uni-goettingen.de/ through a web interface and for download in OBO format.


Subject(s)
Databases, Protein , Transcription Factors/classification , Animals , Binding Sites , DNA/metabolism , DNA-Binding Proteins/chemistry , DNA-Binding Proteins/metabolism , Humans , Internet , Mice , Protein Structure, Tertiary , Rats , Transcription Factors/chemistry , Transcription Factors/metabolism
7.
BMC Bioinformatics ; 16: 200, 2015 Jun 26.
Article in English | MEDLINE | ID: mdl-26108437

ABSTRACT

BACKGROUND: Exploratory analysis of multi-dimensional high-throughput datasets, such as microarray gene expression time series, may be instrumental in understanding the genetic programs underlying numerous biological processes. In such datasets, variations in the gene expression profiles are usually observed across replicates and time points. Thus mining the temporal expression patterns in such multi-dimensional datasets may not only provide insights into the key biological processes governing organs to grow and develop but also facilitate the understanding of the underlying complex gene regulatory circuits. RESULTS: In this work we have developed an evolutionary multi-objective optimization for our previously introduced triclustering algorithm δ-TRIMAX. Its aim is to make optimal use of δ-TRIMAX in extracting groups of co-expressed genes from time series gene expression data, or from any 3D gene expression dataset, by adding the powerful capabilities of an evolutionary algorithm to retrieve overlapping triclusters. We have compared the performance of our newly developed algorithm, EMOA- δ-TRIMAX, with that of other existing triclustering approaches using four artificial dataset and three real-life datasets. Moreover, we have analyzed the results of our algorithm on one of these real-life datasets monitoring the differentiation of human induced pluripotent stem cells (hiPSC) into mature cardiomyocytes. For each group of co-expressed genes belonging to one tricluster, we identified key genes by computing their membership values within the tricluster. It turned out that to a very high percentage, these key genes were significantly enriched in Gene Ontology categories or KEGG pathways that fitted very well to the biological context of cardiomyocytes differentiation. CONCLUSIONS: EMOA- δ-TRIMAX has proven instrumental in identifying groups of genes in transcriptomic data sets that represent the functional categories constituting the biological process under study. The executable file can be found at http://www.bioinf.med.uni-goettingen.de/fileadmin/download/EMOA-delta-TRIMAX.tar.gz .


Subject(s)
Algorithms , Biomarkers/analysis , Cell Differentiation/genetics , Gene Expression Profiling/methods , Induced Pluripotent Stem Cells/metabolism , Myocytes, Cardiac/metabolism , Transcriptome/genetics , Biological Phenomena , Cluster Analysis , Datasets as Topic , Gene Regulatory Networks , Humans , Induced Pluripotent Stem Cells/cytology , Myocytes, Cardiac/cytology , Oligonucleotide Array Sequence Analysis/methods , Time Factors
8.
BMC Bioinformatics ; 16: 400, 2015 Dec 01.
Article in English | MEDLINE | ID: mdl-26627005

ABSTRACT

BACKGROUND: Transcription factors (TFs) are important regulatory proteins that govern transcriptional regulation. Today, it is known that in higher organisms different TFs have to cooperate rather than acting individually in order to control complex genetic programs. The identification of these interactions is an important challenge for understanding the molecular mechanisms of regulating biological processes. In this study, we present a new method based on pointwise mutual information, PC-TraFF, which considers the genome as a document, the sequences as sentences, and TF binding sites (TFBSs) as words to identify interacting TFs in a set of sequences. RESULTS: To demonstrate the effectiveness of PC-TraFF, we performed a genome-wide analysis and a breast cancer-associated sequence set analysis for protein coding and miRNA genes. Our results show that in any of these sequence sets, PC-TraFF is able to identify important interacting TF pairs, for most of which we found support by previously published experimental results. Further, we made a pairwise comparison between PC-TraFF and three conventional methods. The outcome of this comparison study strongly suggests that all these methods focus on different important aspects of interaction between TFs and thus the pairwise overlap between any of them is only marginal. CONCLUSIONS: In this study, adopting the idea from the field of linguistics in the field of bioinformatics, we develop a new information theoretic method, PC-TraFF, for the identification of potentially collaborating transcription factors based on the idiosyncrasy of their binding site distributions on the genome. The results of our study show that PC-TraFF can succesfully identify known interacting TF pairs and thus its currently biologically uncorfirmed predictions could provide new hypotheses for further experimental validation. Additionally, the comparison of the results of PC-TraFF with the results of previous methods demonstrates that different methods with their specific scopes can perfectly supplement each other. Overall, our analyses indicate that PC-TraFF is a time-efficient method where its algorithm has a tractable computational time and memory consumption. The PC-TraFF server is freely accessible at http://pctraff.bioinf.med.uni-goettingen.de/.


Subject(s)
Algorithms , Breast Neoplasms/metabolism , Computational Biology/methods , Gene Expression Regulation, Neoplastic , Genome, Human , Transcription Factors/metabolism , Binding Sites , Breast Neoplasms/classification , Breast Neoplasms/genetics , Female , Humans , MicroRNAs/genetics , Promoter Regions, Genetic/genetics , Protein Binding
9.
Nucleic Acids Res ; 41(Database issue): D165-70, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23180794

ABSTRACT

TFClass (http://tfclass.bioinf.med.uni-goettingen.de/) provides a comprehensive classification of human transcription factors based on their DNA-binding domains. Transcription factors constitute a large functional family of proteins directly regulating the activity of genes. Most of them are sequence-specific DNA-binding proteins, thus reading out the information encoded in cis-regulatory DNA elements of promoters, enhancers and other regulatory regions of a genome. TFClass is a database that classifies human transcription factors by a six-level classification schema, four of which are abstractions according to different criteria, while the fifth level represents TF genes and the sixth individual gene products. Altogether, nine superclasses have been identified, comprising 40 classes and 111 families. Counted by genes, 1558 human TFs have been classified so far or >2900 different TFs when including their isoforms generated by alternative splicing or protein processing events. With this classification, we hope to provide a basis for deciphering protein-DNA recognition codes; moreover, it can be used for constructing expanded transcriptional networks by inferring additional TF-target gene relations.


Subject(s)
Databases, Protein , Transcription Factors/classification , DNA-Binding Proteins/chemistry , Humans , Internet , Protein Structure, Tertiary , Sequence Alignment , Sequence Analysis, Protein , Transcription Factors/chemistry
10.
BMC Bioinformatics ; 15: 96, 2014 Apr 03.
Article in English | MEDLINE | ID: mdl-24694117

ABSTRACT

BACKGROUND: The identification of functionally or structurally important non-conserved residue sites in protein MSAs is an important challenge for understanding the structural basis and molecular mechanism of protein functions. Despite the rich literature on compensatory mutations as well as sequence conservation analysis for the detection of those important residues, previous methods often rely on classical information-theoretic measures. However, these measures usually do not take into account dis/similarities of amino acids which are likely to be crucial for those residues. In this study, we present a new method, the Quantum Coupled Mutation Finder (QCMF) that incorporates significant dis/similar amino acid pair signals in the prediction of functionally or structurally important sites. RESULTS: The result of this study is twofold. First, using the essential sites of two human proteins, namely epidermal growth factor receptor (EGFR) and glucokinase (GCK), we tested the QCMF-method. The QCMF includes two metrics based on quantum Jensen-Shannon divergence to measure both sequence conservation and compensatory mutations. We found that the QCMF reaches an improved performance in identifying essential sites from MSAs of both proteins with a significantly higher Matthews correlation coefficient (MCC) value in comparison to previous methods. Second, using a data set of 153 proteins, we made a pairwise comparison between QCMF and three conventional methods. This comparison study strongly suggests that QCMF complements the conventional methods for the identification of correlated mutations in MSAs. CONCLUSIONS: QCMF utilizes the notion of entanglement, which is a major resource of quantum information, to model significant dissimilar and similar amino acid pair signals in the detection of functionally or structurally important sites. Our results suggest that on the one hand QCMF significantly outperforms the previous method, which mainly focuses on dissimilar amino acid signals, to detect essential sites in proteins. On the other hand, it is complementary to the existing methods for the identification of correlated mutations. The method of QCMF is computationally intensive. To ensure a feasible computation time of the QCMF's algorithm, we leveraged Compute Unified Device Architecture (CUDA).The QCMF server is freely accessible at http://qcmf.informatik.uni-goettingen.de/.


Subject(s)
Algorithms , Mutation , Proteins/chemistry , Proteins/genetics , Amino Acid Sequence , Amino Acids/chemistry , Conserved Sequence , ErbB Receptors/chemistry , ErbB Receptors/genetics , Glucokinase/chemistry , Glucokinase/genetics , Humans , Protein Conformation , Quantum Theory , Sequence Alignment
11.
PLoS Comput Biol ; 9(3): e1002958, 2013.
Article in English | MEDLINE | ID: mdl-23555204

ABSTRACT

Algorithmic comparison of DNA sequence motifs is a problem in bioinformatics that has received increased attention during the last years. Its main applications concern characterization of potentially novel motifs and clustering of a motif collection in order to remove redundancy. Despite growing interest in motif clustering, the question which motif clusters to aim at has so far not been systematically addressed. Here we analyzed motif similarities in a comprehensive set of vertebrate transcription factor classes. For this we developed enhanced similarity scores by inclusion of the information coverage (IC) criterion, which evaluates the fraction of information an alignment covers in aligned motifs. A network-based method enabled us to identify motif clusters with high correspondence to DNA-binding domain phylogenies and prior experimental findings. Based on this analysis we derived a set of motif families representing distinct binding specificities. These motif families were used to train a classifier which was further integrated into a novel algorithm for unsupervised motif clustering. Application of the new algorithm demonstrated its superiority to previously published methods and its ability to reproduce entrained motif families. As a result, our work proposes a probabilistic approach to decide whether two motifs represent common or distinct binding specificities.


Subject(s)
Computational Biology/methods , Nucleotide Motifs , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Algorithms , Cluster Analysis , DNA/genetics , DNA/metabolism , Databases, Genetic , Gene Regulatory Networks , Logistic Models , Phylogeny , Transcription Factors/genetics , Transcription Factors/metabolism
12.
BMC Bioinformatics ; 14: 241, 2013 Aug 08.
Article in English | MEDLINE | ID: mdl-23924163

ABSTRACT

BACKGROUND: Accurate recognition of regulatory elements in promoters is an essential prerequisite for understanding the mechanisms of gene regulation at the level of transcription. Composite regulatory elements represent a particular type of such transcriptional regulatory elements consisting of pairs of individual DNA motifs. In contrast to the present approach, most available recognition techniques are based purely on statistical evaluation of the occurrence of single motifs. Such methods are limited in application, since the accuracy of recognition is greatly dependent on the size and quality of the sequence dataset. Methods that exploit available knowledge and have broad applicability are evidently needed. RESULTS: We developed a novel method to identify composite regulatory elements in promoters using a library of known examples. In depth investigation of regularities encoded in known composite elements allowed us to introduce a new characteristic measure and to improve the specificity compared with other methods. Tests on an established benchmark and real genomic data show that our method outperforms other available methods based either on known examples or statistical evaluations. In addition to better recognition, a practical advantage of this method is first the ability to detect a high number of different types of composite elements, and second direct biological interpretation of the identified results. The program is available at http://gnaweb.helmholtz-hzi.de/cgi-bin/MCatch/MatrixCatch.pl and includes an option to extend the provided library by user supplied data. CONCLUSIONS: The novel algorithm for the identification of composite regulatory elements presented in this paper was proved to be superior to existing methods. Its application to tissue specific promoters identified several highly specific composite elements with relevance to their biological function. This approach together with other methods will further advance the understanding of transcriptional regulation of genes.


Subject(s)
Computational Biology , Promoter Regions, Genetic , Regulatory Elements, Transcriptional , Regulatory Sequences, Nucleic Acid , Algorithms , Computational Biology/instrumentation , Computational Biology/methods , Gene Expression Regulation , Genomics/instrumentation , Genomics/methods , Nucleotide Motifs
13.
Bioinformatics ; 28(18): i509-i514, 2012 Sep 15.
Article in English | MEDLINE | ID: mdl-22962474

ABSTRACT

SUMMARY: The great variety of human cell types in morphology and function is due to the diverse gene expression profiles that are governed by the distinctive regulatory networks in different cell types. It is still a challenging task to explain how the regulatory networks achieve the diversity of different cell types. Here, we report on our studies of the design principles of the tissue regulatory system by constructing the regulatory networks of eight human tissues, which subsume the regulatory interactions between transcription factors (TFs), microRNAs (miRNAs) and non-TF target genes. The results show that there are in-/out-hubs of high in-/out-degrees in tissue networks. Some hubs (strong hubs) maintain the hub status in all the tissues where they are expressed, whereas others (weak hubs), in spite of their ubiquitous expression, are hubs only in some tissues. The network motifs are mostly feed-forward loops. Some of them having no miRNAs are the common motifs shared by all tissues, whereas the others containing miRNAs are the tissue-specific ones owned by one or several tissues, indicating that the transcriptional regulation is more conserved across tissues than the post-transcriptional regulation. In particular, a common bow-tie framework was found that underlies the motif instances and shows diverse patterns in different tissues. Such bow-tie framework reflects the utilization efficiency of the regulatory system as well as its high variability in different tissues, and could serve as the model to further understand the structural adaptation of the regulatory system to the specific requirements of different cell functions. CONTACT: edgar.wingender@bioinf.med.uni-goettingen.de; jwang@nju.edu.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Gene Regulatory Networks , Data Interpretation, Statistical , Gene Expression Regulation , Humans , MicroRNAs/metabolism , Transcription Factors/metabolism
14.
Cancers (Basel) ; 14(9)2022 Apr 21.
Article in English | MEDLINE | ID: mdl-35565214

ABSTRACT

Seventy percent of patients with colorectal cancer develop liver metastases (CRLM), which are a decisive factor in cancer progression. Therapy outcome is largely influenced by tumor heterogeneity, but the intra- and inter-patient heterogeneity of CRLM has been poorly studied. In particular, the contribution of the WNT and EGFR pathways, which are both frequently deregulated in colorectal cancer, has not yet been addressed in this context. To this end, we comprehensively characterized normal liver tissue and eight CRLM from two patients by standardized histopathological, molecular, and proteomic subtyping. Suitable fresh-frozen tissue samples were profiled by transcriptome sequencing (RNA-Seq) and proteomic profiling with reverse phase protein arrays (RPPA) combined with bioinformatic analyses to assess tumor heterogeneity and identify WNT- and EGFR-related master regulators and metastatic effectors. A standardized data analysis pipeline for integrating RNA-Seq with clinical, proteomic, and genetic data was established. Dimensionality reduction of the transcriptome data revealed a distinct signature for CRLM differing from normal liver tissue and indicated a high degree of tumor heterogeneity. WNT and EGFR signaling were highly active in CRLM and the genes of both pathways were heterogeneously expressed between the two patients as well as between the synchronous metastases of a single patient. An analysis of the master regulators and metastatic effectors implicated in the regulation of these genes revealed a set of four genes (SFN, IGF2BP1, STAT1, PIK3CG) that were differentially expressed in CRLM and were associated with clinical outcome in a large cohort of colorectal cancer patients as well as CRLM samples. In conclusion, high-throughput profiling enabled us to define a CRLM-specific signature and revealed the genes of the WNT and EGFR pathways associated with inter- and intra-patient heterogeneity, which were validated as prognostic biomarkers in CRC primary tumors as well as liver metastases.

15.
PLoS One ; 16(10): e0258623, 2021.
Article in English | MEDLINE | ID: mdl-34653224

ABSTRACT

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.


Subject(s)
Breast Neoplasms/genetics , Computational Biology/methods , Data Mining/methods , Algorithms , Breast Neoplasms/metabolism , Databases, Factual , Female , Gene Expression Regulation, Neoplastic , Humans , Machine Learning , Neural Networks, Computer , Protein Interaction Maps , Terminology as Topic
16.
Front Genet ; 12: 670240, 2021.
Article in English | MEDLINE | ID: mdl-34211498

ABSTRACT

Only 2% of glioblastoma multiforme (GBM) patients respond to standard therapy and survive beyond 36 months (long-term survivors, LTS), while the majority survive less than 12 months (short-term survivors, STS). To understand the mechanism leading to poor survival, we analyzed publicly available datasets of 113 STS and 58 LTS. This analysis revealed 198 differentially expressed genes (DEGs) that characterize aggressive tumor growth and may be responsible for the poor prognosis. These genes belong largely to the Gene Ontology (GO) categories "epithelial-to-mesenchymal transition" and "response to hypoxia." In this article, we applied an upstream analysis approach that involves state-of-the-art promoter analysis and network analysis of the dysregulated genes potentially responsible for short survival in GBM. Binding sites for transcription factors (TFs) associated with GBM pathology like NANOG, NF-κB, REST, FRA-1, PPARG, and seven others were found enriched in the promoters of the dysregulated genes. We reconstructed the gene regulatory network with several positive feedback loops controlled by five master regulators [insulin-like growth factor binding protein 2 (IGFBP2), vascular endothelial growth factor A (VEGFA), VEGF165, platelet-derived growth factor A (PDGFA), adipocyte enhancer-binding protein (AEBP1), and oncostatin M (OSMR)], which can be proposed as biomarkers and as therapeutic targets for enhancing GBM prognosis. A critical analysis of this gene regulatory network gives insights into the mechanism of gene regulation by IGFBP2 via several TFs including the key molecule of GBM tumor invasiveness and progression, FRA-1. All the observations were validated in independent cohorts, and their impact on overall survival has been investigated.

17.
NPJ Syst Biol Appl ; 7(1): 38, 2021 10 20.
Article in English | MEDLINE | ID: mdl-34671039

ABSTRACT

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

18.
Brief Bioinform ; 9(4): 326-32, 2008 Jul.
Article in English | MEDLINE | ID: mdl-18436575

ABSTRACT

Since its beginning as a data collection more than 20 years ago, the TRANSFAC project underwent an evolution to become the basis for a complex platform for the description and analysis of gene regulatory events and networks. In the following, I describe what the original concepts were, what their present status is and how they may be expected to contribute to future system biology approaches.


Subject(s)
Chromosome Mapping/methods , Gene Expression Regulation/physiology , Models, Biological , Proteome/metabolism , Signal Transduction/physiology , Systems Biology/methods , Transcription Factors/metabolism , Biotechnology/methods , Computer Simulation , Systems Integration
19.
Brief Bioinform ; 9(6): 518-31, 2008 Nov.
Article in English | MEDLINE | ID: mdl-19073714

ABSTRACT

Translating the exponentially growing amount of omics data into knowledge usable for a personalized medicine approach poses a formidable challenge. In this article-taking diabetes as a use case-we present strategies for developing data repositories into computer-accessible knowledge sources that can be used for a systemic view on the molecular causes of diseases, thus laying the foundation for systems pathology.


Subject(s)
Databases, Factual , Information Storage and Retrieval , Knowledge Bases , Database Management Systems , Diabetes Mellitus/genetics , Diabetes Mellitus/pathology , Diabetes Mellitus/physiopathology , Gene Regulatory Networks , Humans , Information Systems , Semantics , Signal Transduction/physiology , User-Computer Interface
20.
Nucleic Acids Res ; 36(Database issue): D689-94, 2008 Jan.
Article in English | MEDLINE | ID: mdl-18045786

ABSTRACT

EndoNet is an information resource about intercellular regulatory communication. It provides information about hormones, hormone receptors, the sources (i.e. cells, tissues and organs) where the hormones are synthesized and secreted, and where the respective receptors are expressed. The database focuses on the regulatory relations between them. An elementary communication is displayed as a causal link from a cell that secretes a particular hormone to those cells which express the corresponding hormone receptor and respond to the hormone. Whenever expression, synthesis and/or secretion of another hormone are part of this response, it renders the corresponding cell an internal node of the resulting network. This intercellular communication network coordinates the function of different organs. Therefore, the database covers the hierarchy of cellular organization of tissues and organs as it has been modeled in the Cytomer ontology, which has now been directly embedded into EndoNet. The user can query the database; the results can be used to visualize the intercellular information flow. A newly implemented hormone classification enables to browse the database and may be used as alternative entry point. EndoNet is accessible at: http://endonet.bioinf.med.uni-goettingen.de/.


Subject(s)
Cell Communication , Databases, Factual , Hormones/metabolism , Computer Graphics , Hormones/classification , Internet , Receptors, Cell Surface/metabolism , Receptors, Cytoplasmic and Nuclear/metabolism , User-Computer Interface
SELECTION OF CITATIONS
SEARCH DETAIL