RESUMO
BACKGROUND: Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. RESULTS: By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. CONCLUSIONS: BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.
Assuntos
Mineração de Dados/métodos , Doença/genética , Armazenamento e Recuperação da Informação , MEDLINE , Publicações , Pesquisa Translacional Biomédica , Bases de Dados Factuais , Depressão/genética , Doença/classificação , Humanos , Bases de ConhecimentoRESUMO
We have created a 2D morphometric analysis of the developing mouse hindlimb bud. This analysis has provided two useful resources for the study of limb development. First, a temporally accurate numerical description of shape changes during normal mouse limb development. Second, a web-based morphometric staging system, which has the advantage of being easy to use, and with a reproducibility of about ±2 hours. It allows users to upload a dorsal-view photo of a limb bud, draw a spline curve and thereby stage the bud within a couple of minutes. We describe how the system is constructed, its robustness to user variation and illustrate one application: the accurate tracking of spatiotemporal dynamics of gene expression patterns.
Assuntos
Desenvolvimento Embrionário/fisiologia , Botões de Extremidades/anatomia & histologia , Botões de Extremidades/embriologia , Animais , Pesos e Medidas Corporais/métodos , Pesos e Medidas Corporais/normas , Desenvolvimento Embrionário/genética , Perfilação da Expressão Gênica , Regulação da Expressão Gênica no Desenvolvimento , Idade Gestacional , Gráficos de Crescimento , Botões de Extremidades/metabolismo , Camundongos , Modelos Biológicos , Tamanho do Órgão/genética , Tamanho do Órgão/fisiologia , Reprodutibilidade dos Testes , Projetos de Pesquisa/normas , Fatores de Transcrição SOX9/genética , Fatores de Transcrição SOX9/metabolismo , Fatores de TempoRESUMO
Although the vertebrate limb bud has been studied for decades as a model system for spatial pattern formation and cell specification, the cellular basis of its distally oriented elongation has been a relatively neglected topic by comparison. The conventional view is that a gradient of isotropic proliferation exists along the limb, with high proliferation rates at the distal tip and lower rates towards the body, and that this gradient is the driving force behind outgrowth. Here we test this hypothesis by combining quantitative empirical data sets with computer modelling to assess the potential role of spatially controlled proliferation rates in the process of directional limb bud outgrowth. In particular, we generate two new empirical data sets for the mouse hind limb--a numerical description of shape change and a quantitative 3D map of cell cycle times--and combine these with a new 3D finite element model of tissue growth. By developing a parameter optimization approach (which explores spatial patterns of tissue growth) our computer simulations reveal that the observed distribution of proliferation rates plays no significant role in controlling the distally extending limb shape, and suggests that directional cell activities are likely to be the driving force behind limb bud outgrowth. This theoretical prediction prompted us to search for evidence of directional cell orientations in the limb bud mesenchyme, and we thus discovered a striking highly branched and extended cell shape composed of dynamically extending and retracting filopodia, a distally oriented bias in Golgi position, and also a bias in the orientation of cell division. We therefore provide both theoretical and empirical evidence that limb bud elongation is achieved by directional cell activities, rather than a PD gradient of proliferation rates.
Assuntos
Botões de Extremidades/citologia , Botões de Extremidades/embriologia , Morfogênese , Animais , Divisão Celular , Proliferação de Células , Simulação por Computador , Análise de Elementos Finitos , Complexo de Golgi/metabolismo , Botões de Extremidades/crescimento & desenvolvimento , Mesoderma/citologia , Camundongos , Modelos Biológicos , Pseudópodes/metabolismo , Tomografia ÓpticaRESUMO
UNLABELLED: DisGeNET is a plugin for Cytoscape to query and analyze human gene-disease networks. DisGeNET allows user-friendly access to a new gene-disease database that we have developed by integrating data from several public sources. DisGeNET permits queries restricted to (i) the original data source, (ii) the association type, (iii) the disease class or (iv) specific gene(s)/disease(s). It represents gene-disease associations in terms of bipartite graphs and provides gene centric and disease centric views of the data. It assists the user in the interpretation and exploration of the genetic basis of human diseases by a variety of built-in functions. Moreover, DisGeNET permits multicolouring of nodes (genes/diseases) according to standard disease classification for expedient visualization. AVAILABILITY: DisGeNET is compatible with Cytoscape 2.6.3 and 2.7.0, please visit http://ibi.imim.es/DisGeNET/DisGeNETweb.html for installation guide, user tutorial and download.
Assuntos
Biologia Computacional/métodos , Doença/genética , Redes Reguladoras de Genes/genética , Software , Bases de Dados Genéticas , HumanosRESUMO
BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most frequent type of sequence variation between individuals, and represent a promising tool for finding genetic determinants of complex diseases and understanding the differences in drug response. In this regard, it is of particular interest to study the effect of non-synonymous SNPs in the context of biological networks such as cell signalling pathways. UniProt provides curated information about the functional and phenotypic effects of sequence variation, including SNPs, as well as on mutations of protein sequences. However, no strategy has been developed to integrate this information with biological networks, with the ultimate goal of studying the impact of the functional effect of SNPs in the structure and dynamics of biological networks. RESULTS: First, we identified the different challenges posed by the integration of the phenotypic effect of sequence variants and mutations with biological networks. Second, we developed a strategy for the combination of data extracted from public resources, such as UniProt, NCBI dbSNP, Reactome and BioModels. We generated attribute files containing phenotypic and genotypic annotations to the nodes of biological networks, which can be imported into network visualization tools such as Cytoscape. These resources allow the mapping and visualization of mutations and natural variations of human proteins and their phenotypic effect on biological networks (e.g. signalling pathways, protein-protein interaction networks, dynamic models). Finally, an example on the use of the sequence variation data in the dynamics of a network model is presented. CONCLUSION: In this paper we present a general strategy for the integration of pathway and sequence variation data for visualization, analysis and modelling purposes, including the study of the functional impact of protein sequence variations on the dynamics of signalling pathways. This is of particular interest when the SNP or mutation is known to be associated to disease. We expect that this approach will help in the study of the functional impact of disease-associated SNPs on the behaviour of cell signalling pathways, which ultimately will lead to a better understanding of the mechanisms underlying complex diseases.
Assuntos
Armazenamento e Recuperação da Informação/métodos , Modelos Biológicos , Polimorfismo de Nucleotídeo Único , Transdução de Sinais/fisiologia , Biologia de Sistemas/métodos , Simulação por Computador , Receptores ErbB , Fenótipo , Proteínas/química , Proteínas/fisiologia , Análise de Sequência de Proteína , Interface Usuário-ComputadorRESUMO
BACKGROUND: Scientists have been trying to understand the molecular mechanisms of diseases to design preventive and therapeutic strategies for a long time. For some diseases, it has become evident that it is not enough to obtain a catalogue of the disease-related genes but to uncover how disruptions of molecular networks in the cell give rise to disease phenotypes. Moreover, with the unprecedented wealth of information available, even obtaining such catalogue is extremely difficult. PRINCIPAL FINDINGS: We developed a comprehensive gene-disease association database by integrating associations from several sources that cover different biomedical aspects of diseases. In particular, we focus on the current knowledge of human genetic diseases including mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, we performed a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis. The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. We obtained similar results when studying clusters of diseases, suggesting that related diseases might arise due to dysfunction of common biological processes in the cell. CONCLUSIONS: For the first time, we include mendelian, complex and environmental diseases in an integrated gene-disease association database and show that the concept of modularity applies for all of them. We furthermore provide a functional analysis of disease-related modules providing important new biological insights, which might not be discovered when considering each of the gene-disease association repositories independently. Hence, we present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases. AVAILABILITY: The gene-disease networks used in this study and part of the analysis are available at http://ibi.imim.es/DisGeNET/DisGeNETweb.html#Download.
Assuntos
Doença/genética , Meio Ambiente , Redes Reguladoras de Genes/genética , Análise por Conglomerados , Estudos de Associação Genética , Humanos , Família Multigênica/genética , FenótipoRESUMO
BACKGROUND: Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. RESULTS: All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants' solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. CONCLUSIONS: The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs' annotation solutions in comparison to the SSC-I.