Search | VHL Regional Portal

1.

K-RET: knowledgeable biomedical relation extraction system.

Sousa, Diana F; Couto, Francisco M.

Bioinformatics ; 39(4)2023 04 03.

Article in English | MEDLINE | ID: mdl-37018156

ABSTRACT

MOTIVATION: Relation extraction (RE) is a crucial process to deal with the amount of text published daily, e.g. to find missing associations in a database. RE is a text mining task for which the state-of-the-art approaches use bidirectional encoders, namely, BERT. However, state-of-the-art performance may be limited by the lack of efficient external knowledge injection approaches, with a larger impact in the biomedical area given the widespread usage and high quality of biomedical ontologies. This knowledge can propel these systems forward by aiding them in predicting more explainable biomedical associations. With this in mind, we developed K-RET, a novel, knowledgeable biomedical RE system that, for the first time, injects knowledge by handling different types of associations, multiple sources and where to apply it, and multi-token entities. RESULTS: We tested K-RET on three independent and open-access corpora (DDI, BC5CDR, and PGR) using four biomedical ontologies handling different entities. K-RET improved state-of-the-art results by 2.68% on average, with the DDI Corpus yielding the most significant boost in performance, from 79.30% to 87.19% in F-measure, representing a P-value of 2.91×10-12. AVAILABILITY AND IMPLEMENTATION: https://github.com/lasigeBioTM/K-RET.

Subject(s)

Biological Ontologies , Data Mining , Data Mining/methods , Databases, Factual

2.

DGH-GO: dissecting the genetic heterogeneity of complex diseases using gene ontology.

Asif, Muhammad; Martiniano, Hugo F M C; Lamurias, Andre; Kausar, Samina; Couto, Francisco M.

BMC Bioinformatics ; 24(1): 171, 2023 Apr 26.

Article in English | MEDLINE | ID: mdl-37101154

ABSTRACT

BACKGROUND: Complex diseases such as neurodevelopmental disorders (NDDs) exhibit multiple etiologies. The multi-etiological nature of complex-diseases emerges from distinct but functionally similar group of genes. Different diseases sharing genes of such groups show related clinical outcomes that further restrict our understanding of disease mechanisms, thus, limiting the applications of personalized medicine approaches to complex genetic disorders. RESULTS: Here, we present an interactive and user-friendly application, called DGH-GO. DGH-GO allows biologists to dissect the genetic heterogeneity of complex diseases by stratifying the putative disease-causing genes into clusters that may contribute to distinct disease outcome development. It can also be used to study the shared etiology of complex-diseases. DGH-GO creates a semantic similarity matrix for the input genes by using Gene Ontology (GO). The resultant matrix can be visualized in 2D plots using different dimension reduction methods (T-SNE, Principal component analysis, umap and Principal coordinate analysis). In the next step, clusters of functionally similar genes are identified from genes functional similarities assessed through GO. This is achieved by employing four different clustering methods (K-means, Hierarchical, Fuzzy and PAM). The user may change the clustering parameters and explore their effect on stratification immediately. DGH-GO was applied to genes disrupted by rare genetic variants in Autism Spectrum Disorder (ASD) patients. The analysis confirmed the multi-etiological nature of ASD by identifying four clusters of genes that were enriched for distinct biological mechanisms and clinical outcome. In the second case study, the analysis of genes shared by different NDDs showed that genes causing multiple disorders tend to aggregate in similar clusters, indicating a possible shared etiology. CONCLUSION: DGH-GO is a user-friendly application that allows biologists to study the multi-etiological nature of complex diseases by dissecting their genetic heterogeneity. In summary, functional similarities, dimension reduction and clustering methods, coupled with interactive visualization and control over analysis allows biologists to explore and analyze their datasets without requiring expert knowledge on these methods. The source code of proposed application is available at https://github.com/Muh-Asif/DGH-GO.

Subject(s)

Autism Spectrum Disorder , Genetic Heterogeneity , Humans , Gene Ontology , Autism Spectrum Disorder/genetics , Software

3.

SeEn: Sequential enriched datasets for sequence-aware recommendations.

Barros, Marcia; Moitinho, André; Couto, Francisco M.

Sci Data ; 9(1): 478, 2022 08 04.

Article in English | MEDLINE | ID: mdl-35927282

ABSTRACT

The recommendation of items based on the sequential past users' preferences has evolved in the last few years, mostly due to deep learning approaches, such as BERT4Rec. However, in scientific fields, recommender systems for recommending the next best item are not widely used. The main goal of this work is to improve the results for the recommendation of the next best item in scientific domains using sequence aware datasets and algorithms. In the first part of this work, we present the adaptation of a previous method (LIBRETTI) for creating sequential recommendation datasets for scientific fields. The results were assessed in Astronomy and Chemistry. In the second part of this work, we propose a new approach to improve the datasets, not the algorithms, to obtain better recommendations. The new hybrid approach is called sequential enrichment (SeEn), which consists of adding to a sequence of items the n most similar items after each original item. The results show that the enriched sequences obtained better results than the original ones. The Chemistry dataset improved by approximately seven percentage points and the Astronomy dataset by 16 percentage points for Hit Ratio and Normalized Discounted Cumulative Gain.

4.

NILINKER: Attention-based approach to NIL Entity Linking.

Ruas, Pedro; Couto, Francisco M.

J Biomed Inform ; 132: 104137, 2022 08.

Article in English | MEDLINE | ID: mdl-35811025

ABSTRACT

The existence of unlinkable (NIL) entities is a major hurdle affecting the performance of Named Entity Linking approaches, and, consequently, the performance of downstream models that depend on them. Existing approaches to deal with NIL entities focus mainly on clustering and prediction and are limited to general entities. However, other domains, such as the biomedical sciences, are also prone to the existence of NIL entities, given the growing nature of scientific literature. We propose NILINKER, a model that includes a candidate retrieval module for biomedical NIL entities and a neural network that leverages the attention mechanism to find the top-k relevant concepts from target Knowledge Bases (MEDIC, CTD-Chemicals, ChEBI, HP, CTD-Anatomy and Gene Ontology-Biological Process) that may partially represent a given NIL entity. We also make available a new evaluation dataset designated by EvaNIL, suitable for training and evaluating models focusing on the NIL entity linking task. This dataset contains 846,165 documents (abstracts and full-text biomedical articles), including 1,071,776 annotations, distributed by six different partitions: EvaNIL-MEDIC, EvaNIL-CTD-Chemicals, EvaNIL-ChEBI, EvaNIL-HP, EvaNIL-CTD-Anatomy and EvaNIL-Gene Ontology-Biological Process. NILINKER was integrated into a graph-based Named Entity Linking model (REEL) and the results of the experiments show that this approach is able to increase the performance of the Named Entity Linking model.

Subject(s)

Data Mining , Neural Networks, Computer , Cluster Analysis , Data Mining/methods , Gene Ontology , Knowledge Bases

5.

Biomedical Relation Extraction With Knowledge Graph-Based Recommendations.

Sousa, Diana; Couto, Francisco M.

IEEE J Biomed Health Inform ; 26(8): 4207-4217, 2022 08.

Article in English | MEDLINE | ID: mdl-35536818

ABSTRACT

Biomedical Relation Extraction (RE) systems identify and classify relations between biomedical entities to enhance our knowledge of biological and medical processes. Most state-of-the-art systems use deep learning approaches, mainly to target relations between entities of the same type, such as proteins or pharmacological substances. However, these systems are mostly restricted to what they directly identify on the text and ignore specialized domain knowledge bases, such as ontologies, that formalize and integrate biomedical information typically structured as direct acyclic graphs. On the other hand, Knowledge Graph (KG)-based recommendation systems already showed the importance of integrating KGs to add additional features to items. Typical systems have users as people and items that can range from movies to books, which people saw or read and classified according to their satisfaction rate. This work proposes to integrate KGs into biomedical RE through a recommendation model to further improve their range of action. We developed a new RE system, named K-BiOnt, by integrating a baseline state-of-the-art deep biomedical RE system with an existing KG-based recommendation state-of-the-art system. Our results show that adding recommendations from KG-based recommendation improves the system's ability to identify true relations that the baseline deep RE model could not extract from the text. The code supporting this system is available at https://github.com/lasigeBioTM/K-BiOnt.

Subject(s)

Knowledge Bases , Pattern Recognition, Automated , Humans

6.

Prediction of Prostate Cancer Disease Aggressiveness Using Bi-Parametric Mri Radiomics.

Rodrigues, Ana; Santinha, João; Galvão, Bernardo; Matos, Celso; Couto, Francisco M; Papanikolaou, Nickolas.

Cancers (Basel) ; 13(23)2021 Dec 01.

Article in English | MEDLINE | ID: mdl-34885175

ABSTRACT

Prostate cancer is one of the most prevalent cancers in the male population. Its diagnosis and classification rely on unspecific measures such as PSA levels and DRE, followed by biopsy, where an aggressiveness level is assigned in the form of Gleason Score. Efforts have been made in the past to use radiomics coupled with machine learning to predict prostate cancer aggressiveness from clinical images, showing promising results. Thus, the main goal of this work was to develop supervised machine learning models exploiting radiomic features extracted from bpMRI examinations, to predict biological aggressiveness; 288 classifiers were developed, corresponding to different combinations of pipeline aspects, namely, type of input data, sampling strategy, feature selection method and machine learning algorithm. On a cohort of 281 lesions from 183 patients, it was found that (1) radiomic features extracted from the lesion volume of interest were less stable to segmentation than the equivalent extraction from the whole gland volume of interest; and (2) radiomic features extracted from the whole gland volume of interest produced higher performance and less overfitted classifiers than radiomic features extracted from the lesions volumes of interest. This result suggests that the areas surrounding the tumour lesions offer relevant information regarding the Gleason Score that is ultimately attributed to that lesion.

7.

COVID-19 recommender system based on an annotated multilingual corpus.

Barros, Márcia; Ruas, Pedro; Sousa, Diana; Bangash, Ali Haider; Couto, Francisco M.

Genomics Inform ; 19(3): e24, 2021 Sep.

Article in English | MEDLINE | ID: mdl-34638171

ABSTRACT

Tracking the most recent advances in Coronavirus disease 2019 (COVID-19)-related research is essential, given the disease's novelty and its impact on society. However, with the publication pace speeding up, researchers and clinicians require automatic approaches to keep up with the incoming information regarding this disease. A solution to this problem requires the development of text mining pipelines; the efficiency of which strongly depends on the availability of curated corpora. However, there is a lack of COVID-19-related corpora, even more, if considering other languages besides English. This project's main contribution was the annotation of a multilingual parallel corpus and the generation of a recommendation dataset (EN-PT and EN-ES) regarding relevant entities, their relations, and recommendation, providing this resource to the community to improve the text mining research on COVID-19-related literature. This work was developed during the 7th Biomedical Linked Annotation Hackathon (BLAH7).

8.

Text Mining for Building Biomedical Networks Using Cancer as a Case Study.

Conceição, Sofia I R; Couto, Francisco M.

Biomolecules ; 11(10)2021 09 29.

Article in English | MEDLINE | ID: mdl-34680062

ABSTRACT

In the assembly of biological networks it is important to provide reliable interactions in an effort to have the most possible accurate representation of real-life systems. Commonly, the data used to build a network comes from diverse high-throughput essays, however most of the interaction data is available through scientific literature. This has become a challenge with the notable increase in scientific literature being published, as it is hard for human curators to track all recent discoveries without using efficient tools to help them identify these interactions in an automatic way. This can be surpassed by using text mining approaches which are capable of extracting knowledge from scientific documents. One of the most important tasks in text mining for biological network building is relation extraction, which identifies relations between the entities of interest. Many interaction databases already use text mining systems, and the development of these tools will lead to more reliable networks, as well as the possibility to personalize the networks by selecting the desired relations. This review will focus on different approaches of automatic information extraction from biomedical text that can be used to enhance existing networks or create new ones, such as deep learning state-of-the-art approaches, focusing on cancer disease as a case-study.

Subject(s)

Data Mining , Neoplasms/genetics , Computational Biology , Databases, Factual , Humans

9.

Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer.

Lamurias, Andre; Jesus, Sofia; Neveu, Vanessa; Salek, Reza M; Couto, Francisco M.

Front Res Metr Anal ; 6: 689264, 2021.

Article in English | MEDLINE | ID: mdl-34490412

ABSTRACT

Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process. Methods: The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article's relevance based on different datasets made of titles, abstracts and metadata. Results: The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database. Conclusion: Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases.

10.

Hybrid semantic recommender system for chemical compounds in large-scale datasets.

Barros, Marcia; Moitinho, Andre; Couto, Francisco M.

J Cheminform ; 13(1): 15, 2021 Feb 23.

Article in English | MEDLINE | ID: mdl-33622374

ABSTRACT

The large, and increasing, number of chemical compounds poses challenges to the exploration of such datasets. In this work, we propose the usage of recommender systems to identify compounds of interest to scientific researchers. Our approach consists of a hybrid recommender model suitable for implicit feedback datasets and focused on retrieving a ranked list according to the relevance of the items. The model integrates collaborative-filtering algorithms for implicit feedback (Alternating Least Squares and Bayesian Personalized Ranking) and a new content-based algorithm, using the semantic similarity between the chemical compounds in the ChEBI ontology. The algorithms were assessed on an implicit dataset of chemical compounds, CheRM-20, with more than 16.000 items (chemical compounds). The hybrid model was able to improve the results of the collaborative-filtering algorithms, by more than ten percentage points in most of the assessed evaluation metrics.

11.

Using Neural Networks for Relation Extraction from Biomedical Literature.

Sousa, Diana; Lamurias, Andre; Couto, Francisco M.

Methods Mol Biol ; 2190: 289-305, 2021.

Article in English | MEDLINE | ID: mdl-32804372

ABSTRACT

Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.

Subject(s)

Biomedical Research/methods , Data Mining/methods , Neural Networks, Computer , Algorithms , Biological Ontologies , Databases, Factual , Publications

12.

A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing.

Sousa, Diana; Lamurias, Andre; Couto, Francisco M.

Database (Oxford) ; 20202020 12 01.

Article in English | MEDLINE | ID: mdl-33258966

ABSTRACT

Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype-gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.

Subject(s)

Crowdsourcing , Humans , Knowledge Bases

13.

Improving accessibility and distinction between negative results in biomedical relation extraction.

Sousa, Diana; Lamurias, Andre; Couto, Francisco M.

Genomics Inform ; 18(2): e20, 2020 Jun.

Article in English | MEDLINE | ID: mdl-32634874

ABSTRACT

Accessible negative results are relevant for researchers and clinicians not only to limit their search space but also to prevent the costly re-exploration of research hypotheses. However, most biomedical relation extraction datasets do not seek to distinguish between a false and a negative relation among two biomedical entities. Furthermore, datasets created using distant supervision techniques also have some false negative relations that constitute undocumented/unknown relations (missing from a knowledge base). We propose to improve the distinction between these concepts, by revising a subset of the relations marked as false on the phenotype-gene relations corpus and give the first steps to automatically distinguish between the false (F), negative (N), and unknown (U) results. Our work resulted in a sample of 127 manually annotated FNU relations and a weighted-F1 of 0.5609 for their automatic distinction. This work was developed during the 6th Biomedical Linked Annotation Hackathon (BLAH6).

14.

Identification of biological mechanisms underlying a multidimensional ASD phenotype using machine learning.

Asif, Muhammad; Martiniano, Hugo F M C; Marques, Ana Rita; Santos, João Xavier; Vilela, Joana; Rasga, Celia; Oliveira, Guiomar; Couto, Francisco M; Vicente, Astrid M.

Transl Psychiatry ; 10(1): 43, 2020 01 28.

Article in English | MEDLINE | ID: mdl-32066720

ABSTRACT

The complex genetic architecture of Autism Spectrum Disorder (ASD) and its heterogeneous phenotype makes molecular diagnosis and patient prognosis challenging tasks. To establish more precise genotype-phenotype correlations in ASD, we developed a novel machine-learning integrative approach, which seeks to delineate associations between patients' clinical profiles and disrupted biological processes, inferred from their copy number variants (CNVs) that span brain genes. Clustering analysis of the relevant clinical measures from 2446 ASD cases in the Autism Genome Project identified two distinct phenotypic subgroups. Patients in these clusters differed significantly in ADOS-defined severity, adaptive behavior profiles, intellectual ability, and verbal status, the latter contributing the most for cluster stability and cohesion. Functional enrichment analysis of brain genes disrupted by CNVs in these ASD cases identified 15 statistically significant biological processes, including cell adhesion, neural development, cognition, and polyubiquitination, in line with previous ASD findings. A Naive Bayes classifier, generated to predict the ASD phenotypic clusters from disrupted biological processes, achieved predictions with a high precision (0.82) but low recall (0.39), for a subset of patients with higher biological Information Content scores. This study shows that milder and more severe clinical presentations can have distinct underlying biological mechanisms. It further highlights how machine-learning approaches can reduce clinical heterogeneity by using multidimensional clinical measures, and establishes genotype-phenotype correlations in ASD. However, predictions are strongly dependent on patient's information content. Findings are therefore a first step toward the translation of genetic information into clinically useful applications, and emphasize the need for larger datasets with very complete clinical and biological information.

Subject(s)

Autism Spectrum Disorder , Autism Spectrum Disorder/genetics , Bayes Theorem , DNA Copy Number Variations , Humans , Machine Learning , Phenotype

15.

Linking chemical and disease entities to ontologies by integrating PageRank with extracted relations from literature.

Ruas, Pedro; Lamurias, Andre; Couto, Francisco M.

J Cheminform ; 12(1): 57, 2020 Sep 21.

Article in English | MEDLINE | ID: mdl-33430995

ABSTRACT

BACKGROUND: Named Entity Linking systems are a powerful aid to the manual curation of digital libraries, which is getting increasingly costly and inefficient due to the information overload. Models based on the Personalized PageRank (PPR) algorithm are one of the state-of-the-art approaches, but these have low performance when the disambiguation graphs are sparse. FINDINGS: This work proposes a Named Entity Linking framework designated by Relation Extraction for Entity Linking (REEL) that uses automatically extracted relations to overcome this limitation. Our method builds a disambiguation graph, where the nodes are the ontology candidates for the entities and the edges are added according to the relations established in the text, which the method extracts automatically. The PPR algorithm and the information content of each ontology are then applied to choose the candidate for each entity that maximises the coherence of the disambiguation graph. We evaluated the method on three gold standards: the subset of the CRAFT corpus with ChEBI annotations (CRAFT-ChEBI), the subset of the BC5CDR corpus with disease annotations from the MEDIC vocabulary (BC5CDR-Diseases) and the subset with chemical annotations from the CTD-Chemical vocabulary (BC5CDR-Chemicals). The F1-Score achieved by REEL was 85.8%, 80.9% and 90.3% in these gold standards, respectively, outperforming baseline approaches. CONCLUSIONS: We demonstrated that RE tools can improve Named Entity Linking by capturing semantic information expressed in text missing in Knowledge Bases and use it to improve the disambiguation graph of Named Entity Linking models. REEL can be adapted to any text mining pipeline and potentially to any domain, as long as there is an ontology or other knowledge Base available.

16.

DNA-SeAl: Sensitivity Levels to Optimize the Performance of Privacy-Preserving DNA Alignment.

Fernandes, Maria; Decouchant, Jeremie; Volp, Marcus; Couto, Francisco M; Esteves-Verissimo, Paulo.

IEEE J Biomed Health Inform ; 24(3): 907-915, 2020 03.

Article in English | MEDLINE | ID: mdl-31265423

ABSTRACT

The advent of next-generation sequencing (NGS) machines made DNA sequencing cheaper, but also put pressure on the genomic life-cycle, which includes aligning millions of short DNA sequences, called reads, to a reference genome. On the performance side, efficient algorithms have been developed, and parallelized on public clouds. On the privacy side, since genomic data are utterly sensitive, several cryptographic mechanisms have been proposed to align reads more securely than the former, but with a lower performance. This paper presents DNA-SeAl a novel contribution to improving the privacy × performance product in current genomic workflows. First, building on recent works that argue that genomic data needs to be treated according to a threat-risk analysis, we introduce a multi-level sensitivity classification of genomic variations designed to prevent the amplification of possible privacy attacks. We show that the usage of sensitivity levels reduces future re-identification risks, and that their partitioning helps prevent linkage attacks. Second, after extending this classification to reads, we show how to align and store reads using different security levels. To do so, DNA-SeAl extends a recent reads filter to classify unaligned reads into sensitivity levels, and adapts existing alignment algorithms to the reads sensitivity. We show that using DNA-SeAl allows high performance gains whilst enforcing high privacy levels in hybrid cloud environments.

Subject(s)

Computer Security , Confidentiality , Sequence Alignment/methods , Sequence Analysis, DNA/methods , DNA/analysis , DNA/genetics , Databases, Genetic , Genomics , High-Throughput Nucleotide Sequencing , Humans , Polymorphism, Single Nucleotide/genetics

17.

PPR-SSM: personalized PageRank and semantic similarity measures for entity linking.

Lamurias, Andre; Ruas, Pedro; Couto, Francisco M.

BMC Bioinformatics ; 20(1): 534, 2019 Oct 29.

Article in English | MEDLINE | ID: mdl-31664891

ABSTRACT

BACKGROUND: Biomedical literature concerns a wide range of concepts, requiring controlled vocabularies to maintain a consistent terminology across different research groups. However, as new concepts are introduced, biomedical literature is prone to ambiguity, specifically in fields that are advancing more rapidly, for example, drug design and development. Entity linking is a text mining task that aims at linking entities mentioned in the literature to concepts in a knowledge base. For example, entity linking can help finding all documents that mention the same concept and improve relation extraction methods. Existing approaches focus on the local similarity of each entity and the global coherence of all entities in a document, but do not take into account the semantics of the domain. RESULTS: We propose a method, PPR-SSM, to link entities found in documents to concepts from domain-specific ontologies. Our method is based on Personalized PageRank (PPR), using the relations of the ontology to generate a graph of candidate concepts for the mentioned entities. We demonstrate how the knowledge encoded in a domain-specific ontology can be used to calculate the coherence of a set of candidate concepts, improving the accuracy of entity linking. Furthermore, we explore weighting the edges between candidate concepts using semantic similarity measures (SSM). We show how PPR-SSM can be used to effectively link named entities to biomedical ontologies, namely chemical compounds, phenotypes, and gene-product localization and processes. CONCLUSIONS: We demonstrated that PPR-SSM outperforms state-of-the-art entity linking methods in four distinct gold standards, by taking advantage of the semantic information contained in ontologies. Moreover, PPR-SSM is a graph-based method that does not require training data. Our method improved the entity linking accuracy of chemical compounds by 0.1385 when compared to a method that does not use SSMs.

Subject(s)

Semantics , Biological Ontologies , Data Mining/methods , Databases, Factual , Humans , Knowledge Bases , Vocabulary, Controlled

18.

FunVar: A systematic pipeline to unravel the convergence patterns of genetic variants in ASD, a paradigmatic complex disease.

Asif, Muhammad; Vicente, Astrid M; Couto, Francisco M.

J Biomed Inform ; 98: 103273, 2019 10.

Article in English | MEDLINE | ID: mdl-31454647

ABSTRACT

In recent years, the technological advances for capturing genetic variation in large populations led to the identification of large numbers of putative or disease-causing variants. However, their mechanistic understanding is lagging far behind and has posed new challenges regarding their relevance for disease phenotypes, particularly for common complex disorders. In this study, we propose a systematic pipeline to infer biological meaning from genetic variants, namely rare Copy Number Variants (CNVs). The pipeline consists of three modules that seek to (1) improve genetic data quality by excluding low confidence CNVs, (2) identify disrupted biological processes, and (3) aggregate similar enriched biological processes terms using semantic similarity. The proposed pipeline was applied to CNVs from individuals diagnosed with Autism Spectrum Disorder (ASD). We found that rare CNVs disrupting brain expressed genes dysregulated a wide range of biological processes, such as nervous system development and protein polyubiquitination. The disrupted biological processes identified in ASD patients were in accordance with previous findings. This coherence with literature indicates the feasibility of the proposed pipeline in interpreting the biological role of genetic variants in complex disease development. The suggested pipeline is easily adjustable at each step and its independence from any specific dataset and software makes it an effective tool in analyzing existing genetic resources. The FunVar pipeline is available at https://github.com/lasigeBioTM/FunVar and includes pre and post processing steps to effectively interpret biological mechanisms of putative disease causing genetic variants.

Subject(s)

Autism Spectrum Disorder/diagnosis , Autism Spectrum Disorder/genetics , Computational Biology/methods , DNA Copy Number Variations , Polymorphism, Single Nucleotide , Algorithms , Databases, Genetic , Gene Dosage , Genetic Predisposition to Disease , Genome, Human , Genomics , Genotype , Humans , Nervous System , Phenotype , Semantics , Software

19.

Introduction.

Couto, Francisco M.

Adv Exp Med Biol ; 1137: 1-8, 2019.

Article in English | MEDLINE | ID: mdl-31183816

ABSTRACT

Health and Life studies are well known for the huge amount of data they produce, such as high-throughput sequencing projects (Stephens et al., PLoS Biol 13(7):e1002195, 2015; Hey et al., The fourth paradigm: data-intensive scientific discovery, vol 1. Microsoft research Redmond, Redmond, 2009). However, the value of the data should not be measured by its amount, but instead by the possibility and ability of researchers to retrieve and process it (Leonelli, Data-centric biology: a philosophical study. University of Chicago Press, Chicago, 2016). Transparency, openness, and reproducibility are key aspects to boost the discovery of novel insights into how living systems work (Nosek et al., Science 348(6242):1422-1425, 2015).

Subject(s)

Computational Biology , Data Analysis , High-Throughput Nucleotide Sequencing , Reproducibility of Results

20.

Data Retrieval.

Couto, Francisco M.

Adv Exp Med Biol ; 1137: 17-43, 2019.

Article in English | MEDLINE | ID: mdl-31183818

ABSTRACT

This chapter starts by introducing an example of how we can retrieve text, where every step is done manually. The chapter will describe step-by-step how we can automatize each step of the example using shell script commands, which will be introduced and explained as long as they are required. The goal is to equip the reader with a basic set of skills to retrieve data from any online database and follow the links to retrieve more information from other sources, such as literature.

Subject(s)

Databases, Factual , Information Storage and Retrieval , Programming Languages , Internet

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL