RESUMO
The ever-growing availability of biomedical text sources has resulted in a boost in clinical studies based on their exploitation. Biomedical named-entity recognition (bio-NER) techniques have evolved remarkably in recent years and their application in research is increasingly successful. Still, the disparity of tools and the limited available validation resources are barriers preventing a wider diffusion, especially within clinical practice. We here propose the use of omics data and network analysis as an alternative for the assessment of bio-NER tools. Specifically, our method introduces quality criteria based on edge overlap and community detection. The application of these criteria to four bio-NER solutions yielded comparable results to strategies based on annotated corpora, without suffering from their limitations. Our approach can constitute a guide both for the selection of the best bio-NER tool given a specific task, and for the creation and validation of novel approaches.
RESUMO
BACKGROUND AND OBJECTIVES: The growing integration of healthcare sources is improving our understanding of diseases. Cross-mapping resources such as UMLS play a very important role in this area, but their coverage is still incomplete. With the aim to facilitate the integration and interoperability of biological, clinical and literary sources in studies of diseases, we built DisMaNET, a system to cross-map terms from disease vocabularies by leveraging the power and interpretability of network analysis. METHODS: First, we collected and normalized data from 8 disease vocabularies and mapping sources to generate our datasets. Next, we built DisMaNET by integrating the generated datasets into a Neo4j graph database. Then we exploited the query mechanisms of Neo4j to cross-map disease terms of different vocabularies with a relevance score metric and contrasted the results with some state-of-the-art solutions. Finally, we made our system publicly available for its exploitation and evaluation both through a graphical user interface and REST APIs. RESULTS: DisMaNET contains almost half a million nodes and near nine hundred thousand edges, including hierarchical and mapping relationships. Its query capabilities enabled the detection of connections between disease vocabularies that are not present in major mapping sources such as UMLS and the Disease Ontology, even for rare diseases. Furthermore, DisMaNET was capable of obtaining more than 80% of the mappings with UMLS reported in MonDO and DisGeNET, and it was successfully exploited to resolve the missing mappings in the DISNET project. CONCLUSIONS: DisMaNET is a powerful, intuitive and publicly available system to cross-map terms from different disease vocabularies. Our study proves that it is a competitive alternative to existing mapping systems, incorporating the potential of network analysis and the interpretability of the results through a visual interface as its main advantages. Expansion with new sources, versioning and the improvement of the search and scoring algorithms are envisioned as future lines of work.
Assuntos
Vocabulário Controlado , Vocabulário , Algoritmos , Bases de Dados FactuaisRESUMO
BACKGROUND: Within the global endeavour of improving population health, one major challenge is the identification and integration of medical knowledge spread through several information sources. The creation of a comprehensive dataset of diseases and their clinical manifestations based on information from public sources is an interesting approach that allows one not only to complement and merge medical knowledge but also to increase it and thereby to interconnect existing data and analyse and relate diseases to each other. In this paper, we present DISNET (http://disnet.ctb.upm.es/), a web-based system designed to periodically extract the knowledge from signs and symptoms retrieved from medical databases, and to enable the creation of customisable disease networks. METHODS: We here present the main features of the DISNET system. We describe how information on diseases and their phenotypic manifestations is extracted from Wikipedia and PubMed websites; specifically, texts from these sources are processed through a combination of text mining and natural language processing techniques. RESULTS: We further present the validation of our system on Wikipedia and PubMed texts, obtaining the relevant accuracy. The final output includes the creation of a comprehensive symptoms-disease dataset, shared (free access) through the system's API. We finally describe, with some simple use cases, how a user can interact with it and extract information that could be used for subsequent analyses. DISCUSSION: DISNET allows retrieving knowledge about the signs, symptoms and diagnostic tests associated with a disease. It is not limited to a specific category (all the categories that the selected sources of information offer us) and clinical diagnosis terms. It further allows to track the evolution of those terms through time, being thus an opportunity to analyse and observe the progress of human knowledge on diseases. We further discussed the validation of the system, suggesting that it is good enough to be used to extract diseases and diagnostically-relevant terms. At the same time, the evaluation also revealed that improvements could be introduced to enhance the system's reliability.
RESUMO
Over a decade ago, a new discipline called network medicine emerged as an approach to understand human diseases from a network theory point-of-view. Disease networks proved to be an intuitive and powerful way to reveal hidden connections among apparently unconnected biomedical entities such as diseases, physiological processes, signaling pathways, and genes. One of the fields that has benefited most from this improvement is the identification of new opportunities for the use of old drugs, known as drug repurposing. The importance of drug repurposing lies in the high costs and the prolonged time from target selection to regulatory approval of traditional drug development. In this document we analyze the evolution of disease network concept during the last decade and apply a data science pipeline approach to evaluate their functional units. As a result of this analysis, we obtain a list of the most commonly used functional units and the challenges that remain to be solved. This information can be very valuable for the generation of new prediction models based on disease networks.