Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Artículo en Inglés | MEDLINE | ID: mdl-28894735

RESUMEN

In this study, we developed and evaluated a novel text-mining approach, using non-negative tensor factorization (NTF), to simultaneously extract and functionally annotate transcriptional modules consisting of sets of genes, transcription factors (TFs), and terms from MEDLINE abstracts. A sparse 3-mode term × gene × TF tensor was constructed that contained weighted frequencies of 106,895 terms in 26,781 abstracts shared among 7,695 genes and 994 TFs. The tensor was decomposed into sub-tensors using non-negative tensor factorization (NTF) across 16 different approximation ranks. Dominant entries of each of 2,861 sub-tensors were extracted to form term-gene-TF annotated transcriptional modules (ATMs). More than 94% of the ATMs were found to be enriched in at least one KEGG pathway or GO category, suggesting that the ATMs are functionally relevant. One advantage of this method is that it can discover potentially new gene-TF associations from the literature. Using a set of microarray and ChIP-Seq datasets as gold standard, we show that the precision of our method for predicting gene-TF associations is significantly higher than chance. In addition, we demonstrate that the terms in each ATM can be used to suggest new GO classifications to genes and TFs. Taken together, our results indicate that NTF is useful for simultaneous extraction and functional annotation of transcriptional regulatory networks from unstructured text, as well as for literature based discovery. A web tool called Transcriptional Regulatory Modules Extracted from Literature (TREMEL), available at http://binf1.memphis.edu/tremel, was built to enable browsing and searching of ATMs.

2.
Int J Comput Biol Drug Des ; 7(2-3): 113-29, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24878724

RESUMEN

Using a unique combination of visual, statistical, and data mining methods, we tested the hypothesis that an immune cell's movement pattern can convey key information about the cell's function, antigen specificity, and environment. We applied clustering, statistical tests, and a support vector machine (SVM) to assess our ability to classify different datasets of imaged flouresently labelled T cells in mouse liver. We additionally saw clusters of different movement patterns of T cells of identical antigenic specificity. We found that the movement patterns of T cells specific and non-specific for malaria parasites are differentiable with 72% accuracy, and that specific cells have a higher tendency to move towards the parasite than non-specific cells. Movements of antigen-specific T cells in uninfected mice vs. infected mice were differentiable with 69.8% accuracy. We additionally saw clusters of different movement patterns of T cells of identical antigenic specificity. We concluded that our combination of methods has the potential to advance the understanding of cell movements in vivo.


Asunto(s)
Linfocitos T/fisiología , Animales , Movimiento Celular , Malaria/inmunología , Ratones
3.
Int J Comput Biol Drug Des ; 6(1-2): 93-106, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23428476

RESUMEN

Software tools for the flexible examination of genomic sequence information derived from populations of organisms in a geospatial context are few in number, closely tied to Web-based resources, generally focused on one or a few loci or haplotypes, and typically produce a global phylogeny as a summary of relatedness. We sought instead to produce a portable, self-contained analysis tool that is efficiently focused on a geospatial display of specifically chosen polymorphism frequencies or combination frequencies from very large data sets of genome-scale sequence from multiple individuals. PolyLens is a Java-based, integral visual analytical toolkit which can systematically process population genomic data, visualise geographic distributions of genealogical lineages, and display allele distribution patterns. PolyLens is designed for users to visualise specific DNA sequences within each individual and its related location information in the existing data set.


Asunto(s)
Bases de Datos Genéticas , Genómica/métodos , Modelos Genéticos , Polimorfismo Genético , Animales , Secuencia de Bases , Drosophila melanogaster/genética , Genoma de los Insectos , Mapeo Geográfico , Haplotipos , Procesamiento de Imagen Asistido por Computador , Internet , Datos de Secuencia Molecular , Filogenia , Análisis de Secuencia de ADN , Programas Informáticos
4.
BMC Bioinformatics ; 12 Suppl 10: S19, 2011 Oct 18.
Artículo en Inglés | MEDLINE | ID: mdl-22165960

RESUMEN

BACKGROUND: Identification of transcription factors (TFs) responsible for modulation of differentially expressed genes is a key step in deducing gene regulatory pathways. Most current methods identify TFs by searching for presence of DNA binding motifs in the promoter regions of co-regulated genes. However, this strategy may not always be useful as presence of a motif does not necessarily imply a regulatory role. Conversely, motif presence may not be required for a TF to regulate a set of genes. Therefore, it is imperative to include functional (biochemical and molecular) associations, such as those found in the biomedical literature, into algorithms for identification of putative regulatory TFs that might be explicitly or implicitly linked to the genes under investigation. RESULTS: In this study, we present a Latent Semantic Indexing (LSI) based text mining approach for identification and ranking of putative regulatory TFs from microarray derived differentially expressed genes (DEGs). Two LSI models were built using different term weighting schemes to devise pair-wise similarities between 21,027 mouse genes annotated in the Entrez Gene repository. Amongst these genes, 433 were designated TFs in the TRANSFAC database. The LSI derived TF-to-gene similarities were used to calculate TF literature enrichment p-values and rank the TFs for a given set of genes. We evaluated our approach using five different publicly available microarray datasets focusing on TFs Rel, Stat6, Ddit3, Stat5 and Nfic. In addition, for each of the datasets, we constructed gold standard TFs known to be functionally relevant to the study in question. Receiver Operating Characteristics (ROC) curves showed that the log-entropy LSI model outperformed the tf-normal LSI model and a benchmark co-occurrence based method for four out of five datasets, as well as motif searching approaches, in identifying putative TFs. CONCLUSIONS: Our results suggest that our LSI based text mining approach can complement existing approaches used in systems biology research to decipher gene regulatory networks by providing putative lists of ranked TFs that might be explicitly or implicitly associated with sets of DEGs derived from microarray experiments. In addition, unlike motif searching approaches, LSI based approaches can reveal TFs that may indirectly regulate genes.


Asunto(s)
Algoritmos , Minería de Datos/métodos , Redes Reguladoras de Genes , Análisis de Secuencia por Matrices de Oligonucleótidos , Factores de Transcripción/aislamiento & purificación , Secuencias de Aminoácidos , Animales , Humanos , Ratones , PubMed , Biología de Sistemas , Factores de Transcripción/química , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
5.
PLoS One ; 6(4): e18851, 2011 Apr 14.
Artículo en Inglés | MEDLINE | ID: mdl-21533142

RESUMEN

UNLABELLED: High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. AVAILABILITY: GCAT is freely available at http://binf1.memphis.edu/gcat.


Asunto(s)
Indización y Redacción de Resúmenes , Genes , PubMed , Animales , Humanos
6.
BMC Bioinformatics ; 11 Suppl 6: S14, 2010 Oct 07.
Artículo en Inglés | MEDLINE | ID: mdl-20946597

RESUMEN

BACKGROUND: Searching the enormous amount of information available in biomedical literature to extract novel functional relationships among genes remains a challenge in the field of bioinformatics. While numerous (software) tools have been developed to extract and identify gene relationships from biological databases, few effectively deal with extracting new (or implied) gene relationships, a process which is useful in interpretation of discovery-oriented genome-wide experiments. RESULTS: In this study, we develop a Web-based bioinformatics software environment called FAUN or Feature Annotation Using Nonnegative matrix factorization (NMF) to facilitate both the discovery and classification of functional relationships among genes. Both the computational complexity and parameterization of NMF for processing gene sets are discussed. FAUN is tested on three manually constructed gene document collections. Its utility and performance as a knowledge discovery tool is demonstrated using a set of genes associated with Autism. CONCLUSIONS: FAUN not only assists researchers to use biomedical literature efficiently, but also provides utilities for knowledge discovery. This Web-based software environment may be useful for the validation and analysis of functional associations in gene subsets identified by high-throughput experiments.


Asunto(s)
Genómica/métodos , Anotación de Secuencia Molecular , Bases de Datos Factuales , Perfilación de la Expresión Génica , Genoma , Bases del Conocimiento
7.
J Interferon Cytokine Res ; 28(9): 541-51, 2008 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-18715197

RESUMEN

Interferons (IFNs) are critical to the host innate immune response by inducing the expression of a family of early response genes, denoted as IFN-stimulated genes (ISGs). The role of tyrosine phosphorylation of STAT proteins in the transcription activation of ISGs is well-documented. Recent studies have indicated that other transcription factors (TFs) are likely to play a role in regulating ISG expression. Here, we describe a novel integrative approach that combines gene expression profiling, promoter sequence analysis, and literature mining to screen candidate regulatory factors in the IFN signal transduction pathway. Application of this method identified the nuclear factor kappaB (NFkappaB) protein, cRel, as a candidate regulatory factor for a subset of ISGs in mouse embryo fibroblasts. Chromatin immunoprecipitation (ChIP) and real-time PCR assays confirmed that cRel directly binds to the promoters of several ISGs, including Cxcl10, Isg15, Gbp2, Ifit3, and Ifi203, and regulates their expression. Thus, our studies identify cRel as an important TF for ISGs, and validate the approach of using Latent Semantic Indexing (LSI)-based methods to identify regulatory factors from microarray data.


Asunto(s)
Regulación de la Expresión Génica/inmunología , Interferones/inmunología , Proteínas Proto-Oncogénicas c-rel/metabolismo , Factor de Transcripción ReIA/metabolismo , Animales , Línea Celular , Inmunoprecipitación de Cromatina , Biología Computacional , Fibroblastos/efectos de los fármacos , Fibroblastos/inmunología , Perfilación de la Expresión Génica , Interferones/farmacología , Ratones , Ratones Endogámicos C57BL , Regiones Promotoras Genéticas
8.
Comput Intell Neurosci ; : 276535, 2008.
Artículo en Inglés | MEDLINE | ID: mdl-18431447

RESUMEN

Identifying functional groups of genes is a challenging problem for biological applications. Text mining approaches can be used to build hierarchical clusters or trees from the information in the biological literature. In particular, the nonnegative matrix factorization (NMF) is examined as one approach to label hierarchical trees. A generic labeling algorithm as well as an evaluation technique is proposed, and the effects of different NMF parameters with regard to convergence and labeling accuracy are discussed. The primary goals of this study are to provide a qualitative assessment of the NMF and its various parameters and initialization, to provide an automated way to classify biomedical data, and to provide a method for evaluating labeled data assuming a static input tree. As a byproduct, a method for generating gold standard trees is proposed.

9.
Bioinformatics ; 21(1): 104-15, 2005 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-15308538

RESUMEN

MOTIVATION: A major challenge in the interpretation of high-throughput genomic data is understanding the functional associations between genes. Previously, several approaches have been described to extract gene relationships from various biological databases using term-matching methods. However, more flexible automated methods are needed to identify functional relationships (both explicit and implicit) between genes from the biomedical literature. In this study, we explored the utility of Latent Semantic Indexing (LSI), a vector space model for information retrieval, to automatically identify conceptual gene relationships from titles and abstracts in MEDLINE citations. RESULTS: We found that LSI identified gene-to-gene and keyword-to-gene relationships with high average precision. In addition, LSI identified implicit gene relationships based on word usage patterns in the gene abstract documents. Finally, we demonstrate here that pairwise distances derived from the vector angles of gene abstract documents can be effectively used to functionally group genes by hierarchical clustering. Our results provide proof-of-principle that LSI is a robust automated method to elucidate both known (explicit) and unknown (implicit) gene relationships from the biomedical literature. These features make LSI particularly useful for the analysis of novel associations discovered in genomic experiments. AVAILABILITY: The 50-gene document collection used in this study can be interactively queried at http://shad.cs.utk.edu/sgo/sgo.html.


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Algoritmos , Perfilación de la Expresión Génica/métodos , MEDLINE , Procesamiento de Lenguaje Natural , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteínas/clasificación , Proteínas/metabolismo , Análisis por Conglomerados , Documentación/métodos , Almacenamiento y Recuperación de la Información/métodos , Semántica , Transducción de Señal/fisiología
10.
BMC Bioinformatics ; 5: 204, 2004 Dec 17.
Artículo en Inglés | MEDLINE | ID: mdl-15606920

RESUMEN

BACKGROUND: Eukaryotic whole genome sequences are accumulating at an impressive rate. Effective methods for comparing multiple whole eukaryotic genomes on a large scale are needed. Most attempted solutions involve the production of large scale alignments, and many of these require a high stringency pre-screen for putative orthologs in order to reduce the effective size of the dataset and provide a reasonably high but unknown fraction of correctly aligned homologous sites for comparison. As an alternative, highly efficient methods that do not require the pre-alignment of operationally defined orthologs are also being explored. RESULTS: A non-alignment method based on the Singular Value Decomposition (SVD) was used to compare the predicted protein complement of nine whole eukaryotic genomes ranging from yeast to man. This analysis resulted in the simultaneous identification and definition of a large number of well conserved motifs and gene families, and produced a species tree supporting one of two conflicting hypotheses of metazoan relationships. CONCLUSIONS: Our SVD-based analysis of the entire protein complement of nine whole eukaryotic genomes suggests that highly conserved motifs and gene families can be identified and effectively compared in a single coherent definition space for the easy extraction of gene and species trees. While this occurs without the explicit definition of orthologs or homologous sites, the analysis can provide a basis for these definitions.


Asunto(s)
Evolución Biológica , Biología Computacional/métodos , Genoma , Genómica/métodos , Proteómica/métodos , Secuencias de Aminoácidos , Animales , Bases de Datos Genéticas , Bases de Datos de Proteínas , Evolución Molecular , Humanos , Péptidos/química , Filogenia , Proteínas/química , Proteoma , Programas Informáticos , Especificidad de la Especie
11.
J Bioinform Comput Biol ; 1(3): 475-93, 2003 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-15290766

RESUMEN

As whole genome sequences continue to expand in number and complexity, effective methods for comparing and categorizing both genes and species represented within extremely large datasets are required. Methods introduced to date have generally utilized incomplete and likely insufficient subsets of the available data. We have developed an accurate and efficient method for producing robust gene and species phylogenies using very large whole genome protein datasets. This method relies on multidimensional protein vector definitions supplied by the singular value decomposition (SVD) of a large sparse data matrix in which each protein is uniquely represented as a vector of overlapping tetrapeptide frequencies. Quantitative pairwise estimates of species similarity were obtained by summing the protein vectors to form species vectors, then determining the cosines of the angles between species vectors. Evolutionary trees produced using this method confirmed many accepted prokaryotic relationships. However, several unconventional relationships were also noted. In addition, we demonstrate that many of the SVD-derived right basis vectors represent particular conserved protein families, while many of the corresponding left basis vectors describe conserved motifs within these families as sets of correlated peptides (copeps). This analysis represents the most detailed simultaneous comparison of prokaryotic genes and species available to date.


Asunto(s)
Proteínas Bacterianas/genética , Genoma Bacteriano , Genómica/estadística & datos numéricos , Filogenia , Secuencias de Aminoácidos , Secuencia de Aminoácidos , Biología Computacional , Bases de Datos Genéticas , Bases de Datos de Proteínas , Datos de Secuencia Molecular , Péptidos/genética , Proteómica/estadística & datos numéricos , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...