Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 62
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 39(4)2023 04 03.
Artículo en Inglés | MEDLINE | ID: mdl-37018156

RESUMEN

MOTIVATION: Relation extraction (RE) is a crucial process to deal with the amount of text published daily, e.g. to find missing associations in a database. RE is a text mining task for which the state-of-the-art approaches use bidirectional encoders, namely, BERT. However, state-of-the-art performance may be limited by the lack of efficient external knowledge injection approaches, with a larger impact in the biomedical area given the widespread usage and high quality of biomedical ontologies. This knowledge can propel these systems forward by aiding them in predicting more explainable biomedical associations. With this in mind, we developed K-RET, a novel, knowledgeable biomedical RE system that, for the first time, injects knowledge by handling different types of associations, multiple sources and where to apply it, and multi-token entities. RESULTS: We tested K-RET on three independent and open-access corpora (DDI, BC5CDR, and PGR) using four biomedical ontologies handling different entities. K-RET improved state-of-the-art results by 2.68% on average, with the DDI Corpus yielding the most significant boost in performance, from 79.30% to 87.19% in F-measure, representing a P-value of 2.91×10-12. AVAILABILITY AND IMPLEMENTATION: https://github.com/lasigeBioTM/K-RET.


Asunto(s)
Ontologías Biológicas , Minería de Datos , Minería de Datos/métodos , Bases de Datos Factuales
2.
BMC Bioinformatics ; 24(1): 171, 2023 Apr 26.
Artículo en Inglés | MEDLINE | ID: mdl-37101154

RESUMEN

BACKGROUND: Complex diseases such as neurodevelopmental disorders (NDDs) exhibit multiple etiologies. The multi-etiological nature of complex-diseases emerges from distinct but functionally similar group of genes. Different diseases sharing genes of such groups show related clinical outcomes that further restrict our understanding of disease mechanisms, thus, limiting the applications of personalized medicine approaches to complex genetic disorders. RESULTS: Here, we present an interactive and user-friendly application, called DGH-GO. DGH-GO allows biologists to dissect the genetic heterogeneity of complex diseases by stratifying the putative disease-causing genes into clusters that may contribute to distinct disease outcome development. It can also be used to study the shared etiology of complex-diseases. DGH-GO creates a semantic similarity matrix for the input genes by using Gene Ontology (GO). The resultant matrix can be visualized in 2D plots using different dimension reduction methods (T-SNE, Principal component analysis, umap and Principal coordinate analysis). In the next step, clusters of functionally similar genes are identified from genes functional similarities assessed through GO. This is achieved by employing four different clustering methods (K-means, Hierarchical, Fuzzy and PAM). The user may change the clustering parameters and explore their effect on stratification immediately. DGH-GO was applied to genes disrupted by rare genetic variants in Autism Spectrum Disorder (ASD) patients. The analysis confirmed the multi-etiological nature of ASD by identifying four clusters of genes that were enriched for distinct biological mechanisms and clinical outcome. In the second case study, the analysis of genes shared by different NDDs showed that genes causing multiple disorders tend to aggregate in similar clusters, indicating a possible shared etiology. CONCLUSION: DGH-GO is a user-friendly application that allows biologists to study the multi-etiological nature of complex diseases by dissecting their genetic heterogeneity. In summary, functional similarities, dimension reduction and clustering methods, coupled with interactive visualization and control over analysis allows biologists to explore and analyze their datasets without requiring expert knowledge on these methods. The source code of proposed application is available at https://github.com/Muh-Asif/DGH-GO.


Asunto(s)
Trastorno del Espectro Autista , Heterogeneidad Genética , Humanos , Ontología de Genes , Trastorno del Espectro Autista/genética , Programas Informáticos
3.
J Biomed Inform ; 132: 104137, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35811025

RESUMEN

The existence of unlinkable (NIL) entities is a major hurdle affecting the performance of Named Entity Linking approaches, and, consequently, the performance of downstream models that depend on them. Existing approaches to deal with NIL entities focus mainly on clustering and prediction and are limited to general entities. However, other domains, such as the biomedical sciences, are also prone to the existence of NIL entities, given the growing nature of scientific literature. We propose NILINKER, a model that includes a candidate retrieval module for biomedical NIL entities and a neural network that leverages the attention mechanism to find the top-k relevant concepts from target Knowledge Bases (MEDIC, CTD-Chemicals, ChEBI, HP, CTD-Anatomy and Gene Ontology-Biological Process) that may partially represent a given NIL entity. We also make available a new evaluation dataset designated by EvaNIL, suitable for training and evaluating models focusing on the NIL entity linking task. This dataset contains 846,165 documents (abstracts and full-text biomedical articles), including 1,071,776 annotations, distributed by six different partitions: EvaNIL-MEDIC, EvaNIL-CTD-Chemicals, EvaNIL-ChEBI, EvaNIL-HP, EvaNIL-CTD-Anatomy and EvaNIL-Gene Ontology-Biological Process. NILINKER was integrated into a graph-based Named Entity Linking model (REEL) and the results of the experiments show that this approach is able to increase the performance of the Named Entity Linking model.


Asunto(s)
Minería de Datos , Redes Neurales de la Computación , Análisis por Conglomerados , Minería de Datos/métodos , Ontología de Genes , Bases del Conocimiento
4.
BMC Bioinformatics ; 20(Suppl 10): 246, 2019 May 29.
Artículo en Inglés | MEDLINE | ID: mdl-31138117

RESUMEN

BACKGROUND: Given the increasing amount of biomedical resources that are being annotated with concepts from more than one ontology and covering multiple domains of knowledge, it is important to devise mechanisms to compare these resources that take into account the various domains of annotation. For example, metabolic pathways are annotated with their enzymes and their metabolites, and thus similarity measures should compare them with respect to both of those domains simultaneously. RESULTS: In this paper, we propose two approaches to lift existing single-ontology semantic similarity measures into multi-domain measures. The aggregative approach compares domains independently and averages the various similarity values into a final score. The integrative approach integrates all the relevant ontologies into a single one, calculating similarity in the resulting multi-domain ontology using the single-ontology measure. CONCLUSIONS: We evaluated the two approaches in a multidisciplinary epidemiology dataset by evaluating the capacity of the similarity measures to predict new annotations based on the existing ones. The results show a promising increase in performance of the multi-domain measures over the single-ontology ones in the vast majority of the cases. These results show that multi-domain measures outperform single-domain ones, and should be considered by the community as a starting point to study more efficient multi-domain semantic similarity measures.


Asunto(s)
Investigación Biomédica , Semántica , Epidemias , Humanos
5.
BMC Bioinformatics ; 20(1): 534, 2019 Oct 29.
Artículo en Inglés | MEDLINE | ID: mdl-31664891

RESUMEN

BACKGROUND: Biomedical literature concerns a wide range of concepts, requiring controlled vocabularies to maintain a consistent terminology across different research groups. However, as new concepts are introduced, biomedical literature is prone to ambiguity, specifically in fields that are advancing more rapidly, for example, drug design and development. Entity linking is a text mining task that aims at linking entities mentioned in the literature to concepts in a knowledge base. For example, entity linking can help finding all documents that mention the same concept and improve relation extraction methods. Existing approaches focus on the local similarity of each entity and the global coherence of all entities in a document, but do not take into account the semantics of the domain. RESULTS: We propose a method, PPR-SSM, to link entities found in documents to concepts from domain-specific ontologies. Our method is based on Personalized PageRank (PPR), using the relations of the ontology to generate a graph of candidate concepts for the mentioned entities. We demonstrate how the knowledge encoded in a domain-specific ontology can be used to calculate the coherence of a set of candidate concepts, improving the accuracy of entity linking. Furthermore, we explore weighting the edges between candidate concepts using semantic similarity measures (SSM). We show how PPR-SSM can be used to effectively link named entities to biomedical ontologies, namely chemical compounds, phenotypes, and gene-product localization and processes. CONCLUSIONS: We demonstrated that PPR-SSM outperforms state-of-the-art entity linking methods in four distinct gold standards, by taking advantage of the semantic information contained in ontologies. Moreover, PPR-SSM is a graph-based method that does not require training data. Our method improved the entity linking accuracy of chemical compounds by 0.1385 when compared to a method that does not use SSMs.


Asunto(s)
Semántica , Ontologías Biológicas , Minería de Datos/métodos , Bases de Datos Factuales , Humanos , Bases del Conocimiento , Vocabulario Controlado
6.
BMC Bioinformatics ; 20(1): 10, 2019 Jan 07.
Artículo en Inglés | MEDLINE | ID: mdl-30616557

RESUMEN

BACKGROUND: Recent studies have proposed deep learning techniques, namely recurrent neural networks, to improve biomedical text mining tasks. However, these techniques rarely take advantage of existing domain-specific resources, such as ontologies. In Life and Health Sciences there is a vast and valuable set of such resources publicly available, which are continuously being updated. Biomedical ontologies are nowadays a mainstream approach to formalize existing knowledge about entities, such as genes, chemicals, phenotypes, and disorders. These resources contain supplementary information that may not be yet encoded in training data, particularly in domains with limited labeled data. RESULTS: We propose a new model to detect and classify relations in text, BO-LSTM, that takes advantage of domain-specific ontologies, by representing each entity as the sequence of its ancestors in the ontology. We implemented BO-LSTM as a recurrent neural network with long short-term memory units and using open biomedical ontologies, specifically Chemical Entities of Biological Interest (ChEBI), Human Phenotype, and Gene Ontology. We assessed the performance of BO-LSTM with drug-drug interactions mentioned in a publicly available corpus from an international challenge, composed of 792 drug descriptions and 233 scientific abstracts. By using the domain-specific ontology in addition to word embeddings and WordNet, BO-LSTM improved the F1-score of both the detection and classification of drug-drug interactions, particularly in a document set with a limited number of annotations. We adapted an existing DDI extraction model with our ontology-based method, obtaining a higher F1 score than the original model. Furthermore, we developed and made available a corpus of 228 abstracts annotated with relations between genes and phenotypes, and demonstrated how BO-LSTM can be applied to other types of relations. CONCLUSIONS: Our findings demonstrate that besides the high performance of current deep learning techniques, domain-specific ontologies can still be useful to mitigate the lack of labeled data.


Asunto(s)
Ontologías Biológicas , Minería de Datos/métodos , Interacciones Farmacológicas , Ontología de Genes , Memoria a Corto Plazo , Redes Neurales de la Computación , Programas Informáticos , Bases de Datos Factuales , Aprendizaje Profundo , Humanos , Procesamiento de Lenguaje Natural
7.
J Biomed Inform ; 98: 103273, 2019 10.
Artículo en Inglés | MEDLINE | ID: mdl-31454647

RESUMEN

In recent years, the technological advances for capturing genetic variation in large populations led to the identification of large numbers of putative or disease-causing variants. However, their mechanistic understanding is lagging far behind and has posed new challenges regarding their relevance for disease phenotypes, particularly for common complex disorders. In this study, we propose a systematic pipeline to infer biological meaning from genetic variants, namely rare Copy Number Variants (CNVs). The pipeline consists of three modules that seek to (1) improve genetic data quality by excluding low confidence CNVs, (2) identify disrupted biological processes, and (3) aggregate similar enriched biological processes terms using semantic similarity. The proposed pipeline was applied to CNVs from individuals diagnosed with Autism Spectrum Disorder (ASD). We found that rare CNVs disrupting brain expressed genes dysregulated a wide range of biological processes, such as nervous system development and protein polyubiquitination. The disrupted biological processes identified in ASD patients were in accordance with previous findings. This coherence with literature indicates the feasibility of the proposed pipeline in interpreting the biological role of genetic variants in complex disease development. The suggested pipeline is easily adjustable at each step and its independence from any specific dataset and software makes it an effective tool in analyzing existing genetic resources. The FunVar pipeline is available at https://github.com/lasigeBioTM/FunVar and includes pre and post processing steps to effectively interpret biological mechanisms of putative disease causing genetic variants.


Asunto(s)
Trastorno del Espectro Autista/diagnóstico , Trastorno del Espectro Autista/genética , Biología Computacional/métodos , Variaciones en el Número de Copia de ADN , Polimorfismo de Nucleótido Simple , Algoritmos , Bases de Datos Genéticas , Dosificación de Gen , Predisposición Genética a la Enfermedad , Genoma Humano , Genómica , Genotipo , Humanos , Sistema Nervioso , Fenotipo , Semántica , Programas Informáticos
8.
Adv Exp Med Biol ; 1137: 1-8, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31183816

RESUMEN

Health and Life studies are well known for the huge amount of data they produce, such as high-throughput sequencing projects (Stephens et al., PLoS Biol 13(7):e1002195, 2015; Hey et al., The fourth paradigm: data-intensive scientific discovery, vol 1. Microsoft research Redmond, Redmond, 2009). However, the value of the data should not be measured by its amount, but instead by the possibility and ability of researchers to retrieve and process it (Leonelli, Data-centric biology: a philosophical study. University of Chicago Press, Chicago, 2016). Transparency, openness, and reproducibility are key aspects to boost the discovery of novel insights into how living systems work (Nosek et al., Science 348(6242):1422-1425, 2015).


Asunto(s)
Biología Computacional , Análisis de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Reproducibilidad de los Resultados
9.
Adv Exp Med Biol ; 1137: 17-43, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31183818

RESUMEN

This chapter starts by introducing an example of how we can retrieve text, where every step is done manually. The chapter will describe step-by-step how we can automatize each step of the example using shell script commands, which will be introduced and explained as long as they are required. The goal is to equip the reader with a basic set of skills to retrieve data from any online database and follow the links to retrieve more information from other sources, such as literature.


Asunto(s)
Bases de Datos Factuales , Almacenamiento y Recuperación de la Información , Lenguajes de Programación , Internet
10.
Adv Exp Med Biol ; 1137: 9-15, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31183817

RESUMEN

The previous chapter presented the importance of text and semantic resources for Health and Life studies. This chapter will describe what kind of text and semantic resources are available, where they can be found, and how they can be accessed and retrieved.


Asunto(s)
Almacenamiento y Recuperación de la Información , Semántica , Análisis de Datos
11.
Adv Exp Med Biol ; 1137: 45-60, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31183819

RESUMEN

In the previous chapter we were able to automatically process structured data to retrieve biomedical text about any chemical compound, such as caffeine. This chapter will provide a step-by-step introduction to how we can process that text using shell script commands, specifically extract information about diseases related to caffeine. The goal is to equip the reader with an essential set of skills to extract meaningful information from any text.


Asunto(s)
Minería de Datos/métodos , Procesamiento Automatizado de Datos , Cafeína , Programas Informáticos
12.
Adv Exp Med Biol ; 1137: 61-91, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31183820

RESUMEN

In the previous chapter we were able to automatically process text by recognizing a limited set of entities. This chapter will introduce the world of semantics, and present step-by-step examples to retrieve and enhance text and data processing by using semantics. The goal is to equip the reader with the basic set of skills to explore semantic resources that are nowadays available using simple shell script commands.


Asunto(s)
Procesamiento Automatizado de Datos , Almacenamiento y Recuperación de la Información , Semántica
13.
BMC Microbiol ; 18(1): 194, 2018 11 23.
Artículo en Inglés | MEDLINE | ID: mdl-30470193

RESUMEN

BACKGROUND: Theobroma cacao L. (cacao) is a perennial tropical tree, endemic to rainforests of the Amazon Basin. Large populations of bacteria live on leaf surfaces and these phylloplane microorganisms can have important effects on plant health. In recent years, the advent of high-throughput sequencing techniques has greatly facilitated studies of the phylloplane microbiome. In this study, we characterized the bacterial microbiome of the phylloplane of the catongo genotype (susceptible to witch's broom) and CCN51 (resistant). Bacterial microbiome was determined by sequencing the V3-V4 region of the bacterial 16S rRNA gene. RESULTS: After the pre-processing, a total of 1.7 million reads were considered. In total, 106 genera of bacteria were characterized. Proteobacteria was the predominant phylum in both genotypes. The exclusive genera of Catongo showed activity in the protection against UV radiation and in the transport of substrates. CCN51 presented genus that act in the biological control and inhibition in several taxonomic groups. Genotype CCN51 presented greater diversity of microorganisms in comparison to the Catongo genotype and the total community was different between both. Scanning electron microscopy analysis of leaves revealed that on the phylloplane, many bacterial occur in large aggregates in several regions of the surface and isolated nearby to the stomata. CONCLUSIONS: We describe for the first time the phylloplane bacterial communities of T. cacao. The Genotype CCN51, resistant to the witch's broom, has a greater diversity of bacterial microbioma in comparison to Catongo and a greater amount of exclusive microorganisms in the phylloplane with antagonistic action against phytopathogens.


Asunto(s)
Agaricales/fisiología , Bacterias/aislamiento & purificación , Biodiversidad , Cacao/microbiología , Enfermedades de las Plantas/microbiología , Hojas de la Planta/microbiología , Bacterias/clasificación , Bacterias/genética , Bacterias/crecimiento & desarrollo , Cacao/genética , Cacao/inmunología , Cacao/fisiología , Resistencia a la Enfermedad , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Microbiota , Enfermedades de las Plantas/genética , Enfermedades de las Plantas/inmunología , Hojas de la Planta/inmunología , Simbiosis
14.
J Biomed Inform ; 82: 1-12, 2018 06.
Artículo en Inglés | MEDLINE | ID: mdl-29660494

RESUMEN

Sequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption.


Asunto(s)
Confidencialidad , Genómica/métodos , Informática Médica/métodos , Algoritmos , Seguridad Computacional , Reacciones Falso Positivas , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Informática Médica/tendencias , Análisis de Secuencia de ADN , Programas Informáticos
15.
Brief Bioinform ; 16(1): 89-103, 2015 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24197933

RESUMEN

Semantic web technologies offer an approach to data integration and sharing, even for resources developed independently or broadly distributed across the web. This approach is particularly suitable for scientific domains that profit from large amounts of data that reside in the public domain and that have to be exploited in combination. Translational medicine is such a domain, which in addition has to integrate private data from the clinical domain with proprietary data from the pharmaceutical domain. In this survey, we present the results of our analysis of translational medicine solutions that follow a semantic web approach. We assessed these solutions in terms of their target medical use case; the resources covered to achieve their objectives; and their use of existing semantic web resources for the purposes of data sharing, data interoperability and knowledge discovery. The semantic web technologies seem to fulfill their role in facilitating the integration and exploration of data from disparate sources, but it is also clear that simply using them is not enough. It is fundamental to reuse resources, to define mappings between resources, to share data and knowledge. All these aspects allow the instantiation of translational medicine at the semantic web-scale, thus resulting in a network of solutions that can share resources for a faster transfer of new scientific results into the clinical practice. The envisioned network of translational medicine solutions is on its way, but it still requires resolving the challenges of sharing protected data and of integrating semantic-driven technologies into the clinical practice.


Asunto(s)
Difusión de la Información/métodos , Internet , Investigación Biomédica Traslacional , Algoritmos , Biología Computacional/métodos , Humanos
16.
Bioinformatics ; 29(21): 2781-7, 2013 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-24002110

RESUMEN

MOTIVATION: Representing domain knowledge in biology has traditionally been accomplished by creating simple hierarchies of classes with textual annotations. Recently, expressive ontology languages, such as Web Ontology Language, have become more widely adopted, supporting axioms that express logical relationships other than class-subclass, e.g. disjointness. This is improving the coverage and validity of the knowledge contained in biological ontologies. However, current semantic tools still need to adapt to this more expressive information. In this article, we propose a method to integrate disjointness axioms, which are being incorporated in real-world ontologies, such as the Gene Ontology and the chemical entities of biological interest ontology, into semantic similarity, the measure that estimates the closeness in meaning between classes. RESULTS: We present a modification of the measure of shared information content, which extends the base measure to allow the incorporation of disjointness information. To evaluate our approach, we applied it to several randomly selected datasets extracted from the chemical entities of biological interest ontology. In 93.8% of these datasets, our measure performed better than the base measure of shared information content. This supports the idea that semantic similarity is more accurate if it extends beyond the hierarchy of classes of the ontology. CONTACT: joao.ferreira@lasige.di.fc.ul.pt. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Vocabulario Controlado , Interpretación Estadística de Datos , Semántica
17.
Nutrients ; 16(11)2024 May 26.
Artículo en Inglés | MEDLINE | ID: mdl-38892560

RESUMEN

Blood selenium (Se) concentrations differ substantially by population and could be influenced by genetic variants, increasing Se deficiency-related diseases. We conducted a genome-wide association study (GWAS) to identify single nucleotide polymorphisms (SNPs) associated with serum Se deficiency in 382 adults with admixed ancestry. Genotyping arrays were combined to yield 90,937 SNPs. R packages were applied to quality control and imputation. We also performed the ancestral proportion analysis. The Search Tool for the Retrieval of Interacting Genes was used to interrogate known protein-protein interaction networks (PPIs). Our ancestral proportion analysis estimated 71% of the genome was from Caucasians, 22% was from Africans, and 8% was from East Asians. We identified the SNP rs1561573 in the TraB domain containing 2B (TRABD2B), rs425664 in MAF bZIP transcription factor (MAF), rs10444656 in spermatogenesis-associated 13 (SPATA13), and rs6592284 in heat shock protein nuclear import factor (HIKESHI) genes. The PPI analysis showed functional associations of Se deficiency, thyroid hormone metabolism, NRF2-ARE and the Wnt pathway, and heat stress. Our findings show evidence of a genetic association between Se deficiency and metabolic pathways indirectly linked to Se regulation, reinforcing the complex relationship between Se intake and the endogenous factors affecting the Se requirements for optimal health.


Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Selenio , Adulto , Femenino , Humanos , Masculino , Persona de Mediana Edad , Brasil , Predisposición Genética a la Enfermedad , Genotipo , Mapas de Interacción de Proteínas/genética , Selenio/sangre , Selenio/deficiencia , Población Blanca/genética , Pueblo Africano , Pueblos del Este de Asia
18.
PLoS Comput Biol ; 8(9): e1002630, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23028267

RESUMEN

Developing and extending a biomedical ontology is a very demanding task that can never be considered complete given our ever-evolving understanding of the life sciences. Extension in particular can benefit from the automation of some of its steps, thus releasing experts to focus on harder tasks. Here we present a strategy to support the automation of change capturing within ontology extension where the need for new concepts or relations is identified. Our strategy is based on predicting areas of an ontology that will undergo extension in a future version by applying supervised learning over features of previous ontology versions. We used the Gene Ontology as our test bed and obtained encouraging results with average f-measure reaching 0.79 for a subset of biological process terms. Our strategy was also able to outperform state of the art change capturing methods. In addition we have identified several issues concerning prediction of ontology evolution, and have delineated a general framework for ontology extension prediction. Our strategy can be applied to any biomedical ontology with versioning, to help focus either manual or semi-automated extension methods on areas of the ontology that need extension.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Sistemas de Administración de Bases de Datos , Evolución Molecular , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Vocabulario Controlado , Inteligencia Artificial , Humanos
19.
IEEE J Biomed Health Inform ; 26(8): 4207-4217, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35536818

RESUMEN

Biomedical Relation Extraction (RE) systems identify and classify relations between biomedical entities to enhance our knowledge of biological and medical processes. Most state-of-the-art systems use deep learning approaches, mainly to target relations between entities of the same type, such as proteins or pharmacological substances. However, these systems are mostly restricted to what they directly identify on the text and ignore specialized domain knowledge bases, such as ontologies, that formalize and integrate biomedical information typically structured as direct acyclic graphs. On the other hand, Knowledge Graph (KG)-based recommendation systems already showed the importance of integrating KGs to add additional features to items. Typical systems have users as people and items that can range from movies to books, which people saw or read and classified according to their satisfaction rate. This work proposes to integrate KGs into biomedical RE through a recommendation model to further improve their range of action. We developed a new RE system, named K-BiOnt, by integrating a baseline state-of-the-art deep biomedical RE system with an existing KG-based recommendation state-of-the-art system. Our results show that adding recommendations from KG-based recommendation improves the system's ability to identify true relations that the baseline deep RE model could not extract from the text. The code supporting this system is available at https://github.com/lasigeBioTM/K-BiOnt.


Asunto(s)
Bases del Conocimiento , Reconocimiento de Normas Patrones Automatizadas , Humanos
20.
Sci Data ; 9(1): 478, 2022 08 04.
Artículo en Inglés | MEDLINE | ID: mdl-35927282

RESUMEN

The recommendation of items based on the sequential past users' preferences has evolved in the last few years, mostly due to deep learning approaches, such as BERT4Rec. However, in scientific fields, recommender systems for recommending the next best item are not widely used. The main goal of this work is to improve the results for the recommendation of the next best item in scientific domains using sequence aware datasets and algorithms. In the first part of this work, we present the adaptation of a previous method (LIBRETTI) for creating sequential recommendation datasets for scientific fields. The results were assessed in Astronomy and Chemistry. In the second part of this work, we propose a new approach to improve the datasets, not the algorithms, to obtain better recommendations. The new hybrid approach is called sequential enrichment (SeEn), which consists of adding to a sequence of items the n most similar items after each original item. The results show that the enriched sequences obtained better results than the original ones. The Chemistry dataset improved by approximately seven percentage points and the Astronomy dataset by 16 percentage points for Hit Ratio and Normalized Discounted Cumulative Gain.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA