Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 78
Filtrar
1.
BMC Bioinformatics ; 25(1): 62, 2024 Feb 07.
Artigo em Inglês | MEDLINE | ID: mdl-38326757

RESUMO

BACKGROUND: Recent developments in the domain of biomedical knowledge bases (KBs) open up new ways to exploit biomedical knowledge that is available in the form of KBs. Significant work has been done in the direction of biomedical KB creation and KB completion, specifically, those having gene-disease associations and other related entities. However, the use of such biomedical KBs in combination with patients' temporal clinical data still largely remains unexplored, but has the potential to immensely benefit medical diagnostic decision support systems. RESULTS: We propose two new algorithms, LOADDx and SCADDx, to combine a patient's gene expression data with gene-disease association and other related information available in the form of a KB, to assist personalized disease diagnosis. We have tested both of the algorithms on two KBs and on four real-world gene expression datasets of respiratory viral infection caused by Influenza-like viruses of 19 subtypes. We also compare the performance of proposed algorithms with that of five existing state-of-the-art machine learning algorithms (k-NN, Random Forest, XGBoost, Linear SVM, and SVM with RBF Kernel) using two validation approaches: LOOCV and a single internal validation set. Both SCADDx and LOADDx outperform the existing algorithms when evaluated with both validation approaches. SCADDx is able to detect infections with up to 100% accuracy in the cases of Datasets 2 and 3. Overall, SCADDx and LOADDx are able to detect an infection within 72 h of infection with 91.38% and 92.66% average accuracy respectively considering all four datasets, whereas XGBoost, which performed best among the existing machine learning algorithms, can detect the infection with only 86.43% accuracy on an average. CONCLUSIONS: We demonstrate how our novel idea of using the most and least differentially expressed genes in combination with a KB can enable identification of the diseases that a patient is most likely to have at a particular time, from a KB with thousands of diseases. Moreover, the proposed algorithms can provide a short ranked list of the most likely diseases for each patient along with their most affected genes, and other entities linked with them in the KB, which can support health care professionals in their decision-making.


Assuntos
Bases de Conhecimento , Transcriptoma , Humanos , Algoritmos , Aprendizado de Máquina
2.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37478371

RESUMO

Artificial intelligence (AI) systems utilizing deep neural networks and machine learning (ML) algorithms are widely used for solving critical problems in bioinformatics, biomedical informatics and precision medicine. However, complex ML models that are often perceived as opaque and black-box methods make it difficult to understand the reasoning behind their decisions. This lack of transparency can be a challenge for both end-users and decision-makers, as well as AI developers. In sensitive areas such as healthcare, explainability and accountability are not only desirable properties but also legally required for AI systems that can have a significant impact on human lives. Fairness is another growing concern, as algorithmic decisions should not show bias or discrimination towards certain groups or individuals based on sensitive attributes. Explainable AI (XAI) aims to overcome the opaqueness of black-box models and to provide transparency in how AI systems make decisions. Interpretable ML models can explain how they make predictions and identify factors that influence their outcomes. However, the majority of the state-of-the-art interpretable ML methods are domain-agnostic and have evolved from fields such as computer vision, automated reasoning or statistics, making direct application to bioinformatics problems challenging without customization and domain adaptation. In this paper, we discuss the importance of explainability and algorithmic transparency in the context of bioinformatics. We provide an overview of model-specific and model-agnostic interpretable ML methods and tools and outline their potential limitations. We discuss how existing interpretable ML methods can be customized and fit to bioinformatics research problems. Further, through case studies in bioimaging, cancer genomics and text mining, we demonstrate how XAI methods can improve transparency and decision fairness. Our review aims at providing valuable insights and serving as a starting point for researchers wanting to enhance explainability and decision transparency while solving bioinformatics problems. GitHub: https://github.com/rezacsedu/XAI-for-bioinformatics.


Assuntos
Inteligência Artificial , Biologia Computacional , Humanos , Aprendizado de Máquina , Algoritmos , Genômica
3.
J Biomed Semantics ; 13(1): 3, 2022 01 24.
Artigo em Inglês | MEDLINE | ID: mdl-35073996

RESUMO

BACKGROUND: Drug repurposing can improve the return of investment as it finds new uses for existing drugs. Literature-based analyses exploit factual knowledge on drugs and diseases, e.g. from databases, and combine it with information from scholarly publications. Here we report the use of the Open Discovery Process on scientific literature to identify non-explicit ties between a disease, namely epilepsy, and known drugs, making full use of available epilepsy-specific ontologies. RESULTS: We identified characteristics of epilepsy-specific ontologies to create subsets of documents from the literature; from these subsets we generated ranked lists of co-occurring neurological drug names with varying specificity. From these ranked lists, we observed a high intersection regarding reference lists of pharmaceutical compounds recommended for the treatment of epilepsy. Furthermore, we performed a drug set enrichment analysis, i.e. a novel scoring function using an adaptive tuning parameter and comparing top-k ranked lists taking into account the varying length and the current position in the list. We also provide an overview of the pharmaceutical space in the context of epilepsy, including a final combined ranked list of more than 70 drug names. CONCLUSIONS: Biomedical ontologies are a rich resource that can be combined with text mining for the identification of drug names for drug repurposing in the domain of epilepsy. The ranking of the drug names related to epilepsy provides benefits to patients and to researchers as it enables a quick evaluation of statistical evidence hidden in the scientific literature, useful to validate approaches in the drug discovery process.


Assuntos
Ontologias Biológicas , Epilepsia , Preparações Farmacêuticas , Mineração de Dados , Reposicionamento de Medicamentos , Epilepsia/tratamento farmacológico , Humanos
4.
Artigo em Inglês | MEDLINE | ID: mdl-32750845

RESUMO

The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the '1000 genomes' (covering 2,504 individuals from 26 ethnic origins) and 'Simons genome diversity' (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Algoritmos , Análise por Conglomerados , Humanos
5.
Brief Bioinform ; 22(1): 393-415, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-32008043

RESUMO

Clustering is central to many data-driven bioinformatics research and serves a powerful computational method. In particular, clustering helps at analyzing unstructured and high-dimensional data in the form of sequences, expressions, texts and images. Further, clustering is used to gain insights into biological processes in the genomics level, e.g. clustering of gene expressions provides insights on the natural structure inherent in the data, understanding gene functions, cellular processes, subtypes of cells and understanding gene regulations. Subsequently, clustering approaches, including hierarchical, centroid-based, distribution-based, density-based and self-organizing maps, have long been studied and used in classical machine learning settings. In contrast, deep learning (DL)-based representation and feature learning for clustering have not been reviewed and employed extensively. Since the quality of clustering is not only dependent on the distribution of data points but also on the learned representation, deep neural networks can be effective means to transform mappings from a high-dimensional data space into a lower-dimensional feature space, leading to improved clustering results. In this paper, we review state-of-the-art DL-based approaches for cluster analysis that are based on representation learning, which we hope to be useful, particularly for bioinformatics research. Further, we explore in detail the training procedures of DL-based clustering algorithms, point out different clustering quality metrics and evaluate several DL-based approaches on three bioinformatics use cases, including bioimaging, cancer genomics and biomedical text mining. We believe this review and the evaluation results will provide valuable insights and serve a starting point for researchers wanting to apply DL-based unsupervised methods to solve emerging bioinformatics research problems.


Assuntos
Biologia Computacional/métodos , Aprendizado Profundo , Análise por Conglomerados
6.
J Med Internet Res ; 22(3): e13680, 2020 03 13.
Artigo em Inglês | MEDLINE | ID: mdl-32167477

RESUMO

BACKGROUND: Public health surveillance is based on the continuous and systematic collection, analysis, and interpretation of data. This informs the development of early warning systems to monitor epidemics and documents the impact of intervention measures. The introduction of digital data sources, and specifically sources available on the internet, has impacted the field of public health surveillance. New opportunities enabled by the underlying availability and scale of internet-based sources (IBSs) have paved the way for novel approaches for disease surveillance, exploration of health communities, and the study of epidemic dynamics. This field and approach is also known as infodemiology or infoveillance. OBJECTIVE: This review aimed to assess research findings regarding the application of IBSs for public health surveillance (infodemiology or infoveillance). To achieve this, we have presented a comprehensive systematic literature review with a focus on these sources and their limitations, the diseases targeted, and commonly applied methods. METHODS: A systematic literature review was conducted targeting publications between 2012 and 2018 that leveraged IBSs for public health surveillance, outbreak forecasting, disease characterization, diagnosis prediction, content analysis, and health-topic identification. The search results were filtered according to previously defined inclusion and exclusion criteria. RESULTS: Spanning a total of 162 publications, we determined infectious diseases to be the preferred case study (108/162, 66.7%). Of the eight categories of IBSs (search queries, social media, news, discussion forums, websites, web encyclopedia, and online obituaries), search queries and social media were applied in 95.1% (154/162) of the reviewed publications. We also identified limitations in representativeness and biased user age groups, as well as high susceptibility to media events by search queries, social media, and web encyclopedias. CONCLUSIONS: IBSs are a valuable proxy to study illnesses affecting the general population; however, it is important to characterize which diseases are best suited for the available sources; the literature shows that the level of engagement among online platforms can be a potential indicator. There is a necessity to understand the population's online behavior; in addition, the exploration of health information dissemination and its content is significantly unexplored. With this information, we can understand how the population communicates about illnesses online and, in the process, benefit public health.


Assuntos
Disseminação de Informação/métodos , Vigilância em Saúde Pública/métodos , Humanos , Internet
7.
BMC Bioinformatics ; 20(1): 462, 2019 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-31500564

RESUMO

BACKGROUND: Determining the association between tumor sample and the gene is demanding because it requires a high cost for conducting genetic experiments. Thus, the discovered association between tumor sample and gene further requires clinical verification and validation. This entire mechanism is time-consuming and expensive. Due to this issue, predicting the association between tumor samples and genes remain a challenge in biomedicine. RESULTS: Here we present, a computational model based on a heat diffusion algorithm which can predict the association between tumor samples and genes. We proposed a 2-layered graph. In the first layer, we constructed a graph of tumor samples and genes where these two types of nodes are connected by "hasGene" relationship. In the second layer, the gene nodes are connected by "interaction" relationship. We applied the heat diffusion algorithms in nine different variants of genetic interaction networks extracted from STRING and BioGRID database. The heat diffusion algorithm predicted the links between tumor samples and genes with mean AUC-ROC score of 0.84. This score is obtained by using weighted genetic interactions of fusion or co-occurrence channels from the STRING database. For the unweighted genetic interaction from the BioGRID database, the algorithms predict the links with an AUC-ROC score of 0.74. CONCLUSIONS: We demonstrate that the gene-gene interaction scores could improve the predictive power of the heat diffusion model to predict the links between tumor samples and genes. We showed the efficient runtime of the heat diffusion algorithm in various genetic interaction network. We statistically validated our prediction quality of the links between tumor samples and genes.


Assuntos
Algoritmos , Genes Neoplásicos , Neoplasias/genética , Área Sob a Curva , Metilação de DNA/genética , Bases de Dados Factuais , Difusão , Epistasia Genética , Redes Reguladoras de Genes , Humanos , Curva ROC , Reprodutibilidade dos Testes
8.
Artigo em Inglês | MEDLINE | ID: mdl-31480718

RESUMO

Annual suicide figures are critical in identifying trends and guiding research, yet challenges arising from significant lags in reporting can delay and complicate real-time interventions. In this paper, we utilized Google Trends search volumes for behavioral forecasting of national suicide rates in Ireland between 2004 and 2015. Official suicide rates are recorded by the Central Statistics Office in Ireland. While similar investigations using Google trends data have been carried out in other jurisdictions (e.g., United Kingdom, United Stated of America), such research had not yet been completed in Ireland. We compiled a collection of suicide- and depression-related search terms suggested by Google Trends and manually sourced from the literature. Monthly search rate terms at different lags were compared with suicide occurrences to determine the degree of correlation. Following two approaches based on vector autoregression and neural network autoregression, we achieved mean absolute error values between 4.14 and 9.61 when incorporating search query data, with the highest performance for the neural network approach. The application of this process to United Kingdom suicide and search query data showed similar results, supporting the benefit of Google Trends, neural network approach, and the applied search terms to forecast suicide risk increase. Overall, the combination of societal data and online behavior provide a good indication of societal risks; building on past research, our improvements led to robust models integrating search query and unemployment data for suicide risk forecasting in Ireland.


Assuntos
Depressão , Previsões , Ferramenta de Busca/estatística & dados numéricos , Suicídio/estatística & dados numéricos , Humanos , Irlanda/epidemiologia , Redes Neurais de Computação , Risco , Desemprego
9.
Genomics Inform ; 17(2): e14, 2019 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-31307129

RESUMO

The total number of scholarly publications grows day by day, making it necessary to explore and use simple yet effective ways to expose their metadata. Schema.org supports adding structured metadata to web pages via markup, making it easier for data providers but also for search engines to provide the right search results. Bioschemas is based on the standards of schema.org, providing new types, properties and guidelines for metadata, i.e., providing metadata profiles tailored to the Life Sciences domain. Here we present our proposed contribution to Bioschemas (from the project "Biotea"), which supports metadata contributions for scholarly publications via profiles and web components. Biotea comprises a semantic model to represent publications together with annotated elements recognized from the scientific text; our Biotea model has been mapped to schema.org following Bioschemas standards.

10.
Sci Rep ; 9(1): 5761, 2019 04 08.
Artigo em Inglês | MEDLINE | ID: mdl-30962509

RESUMO

Knee osteoarthritis (KOA) is a disease that impairs knee function and causes pain. A radiologist reviews knee X-ray images and grades the severity level of the impairments according to the Kellgren and Lawrence grading scheme; a five-point ordinal scale (0-4). In this study, we used Elastic Net (EN) and Random Forests (RF) to build predictive models using patient assessment data (i.e. signs and symptoms of both knees and medication use) and a convolution neural network (CNN) trained using X-ray images only. Linear mixed effect models (LMM) were used to model the within subject correlation between the two knees. The root mean squared error for the CNN, EN, and RF models was 0.77, 0.97 and 0.94 respectively. The LMM shows similar overall prediction accuracy as the EN regression but correctly accounted for the hierarchical structure of the data resulting in more reliable inference. Useful explanatory variables were identified that could be used for patient monitoring before X-ray imaging. Our analyses suggest that the models trained for predicting the KOA severity levels achieve comparable results when modeling X-ray images and patient data. The subjectivity in the KL grade is still a primary concern.


Assuntos
Modelos Estatísticos , Osteoartrite do Joelho/diagnóstico por imagem , Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Idoso , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Redes Neurais de Computação , Osteoartrite do Joelho/patologia , Prognóstico
11.
Brief Bioinform ; 20(4): 1477-1491, 2019 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-29579141

RESUMO

MOTIVATION: Searching for precise terms and terminological definitions in the biomedical data space is problematic, as researchers find overlapping, closely related and even equivalent concepts in a single or multiple ontologies. Search engines that retrieve ontological resources often suggest an extensive list of search results for a given input term, which leads to the tedious task of selecting the best-fit ontological resource (class or property) for the input term and reduces user confidence in the retrieval engines. A systematic evaluation of these search engines is necessary to understand their strengths and weaknesses in different search requirements. RESULT: We have implemented seven comparable Information Retrieval ranking algorithms to search through ontologies and compared them against four search engines for ontologies. Free-text queries have been performed, the outcomes have been judged by experts and the ranking algorithms and search engines have been evaluated against the expert-based ground truth (GT). In addition, we propose a probabilistic GT that is developed automatically to provide deeper insights and confidence to the expert-based GT as well as evaluating a broader range of search queries. CONCLUSION: The main outcome of this work is the identification of key search factors for biomedical ontologies together with search requirements and a set of recommendations that will help biomedical experts and ontology engineers to select the best-suited retrieval mechanism in their search scenarios. We expect that this evaluation will allow researchers and practitioners to apply the current search techniques more reliably and that it will help them to select the right solution for their daily work. AVAILABILITY: The source code (of seven ranking algorithms), ground truths and experimental results are available at https://github.com/danielapoliveira/bioont-search-benchmark.


Assuntos
Ontologias Biológicas/estatística & dados numéricos , Algoritmos , Biologia Computacional , Sistemas Inteligentes , Humanos , Armazenamento e Recuperação da Informação , Modelos Estatísticos , Ferramenta de Busca
12.
J Biomed Semantics ; 9(1): 18, 2018 06 12.
Artigo em Inglês | MEDLINE | ID: mdl-29895320

RESUMO

BACKGROUND: In recent years, Twitter has been applied to monitor diseases through its facility to monitor users' comments and concerns in real-time. The analysis of tweets for disease mentions should reflect not only user specific concerns but also disease outbreaks. This requires the use of standard terminological resources and can be focused on selected geographic locations. In our study, we differentiate between hospital and airport locations to better distinguish disease outbreaks from background mentions of disease concerns. RESULTS: Our analysis covers all geolocated tweets over a 6 months time period, uses SNOMED-CT as a standard medical terminology, and explores language patterns (as well as MetaMap) to identify mentions of diseases in reference to the geolocation of tweets. Contrary to our expectation, hospital and airport geolocations are not suitable to collect significant portions of tweets concerned with disease outcomes. Overall, geolocated tweets exposed a large number of messages commenting on disease-related news articles. Furthermore, the geolocated messages exposed an over-representation of non-communicable diseases in contrast to infectious diseases. CONCLUSIONS: Our findings suggest that disease mentions on Twitter not only serve the purpose to share personal statements but also to share concerns about news articles. In particular, our assumption about the relevance of hospital and airport geolocations for an increased frequency of diseases mentions has not been met. To further address the linguistic cues, we propose the study of health forums to understand how a change in medium affects the language applied by the users. Finally, our research on the language use may provide essential clues to distinguish complementary trends in the use of language in Twitter when analysing health-related topics.


Assuntos
Aeroportos , Mineração de Dados , Doença , Hospitais , Mídias Sociais , Geografia
13.
Brief Bioinform ; 19(5): 1035-1050, 2018 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-28419324

RESUMO

Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large-scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large-scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high-throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large-scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community.


Assuntos
Computação em Nuvem , Biologia Computacional/métodos , Fluxo de Trabalho , Big Data , Interpretação Estatística de Dados , Sistemas de Gerenciamento de Base de Dados , Descoberta de Drogas/estatística & dados numéricos , Genômica/estatística & dados numéricos , Humanos , Disseminação de Informação , Bases de Conhecimento , Web Semântica/estatística & dados numéricos , Interface Usuário-Computador
14.
J Biomed Semantics ; 8(1): 40, 2017 Sep 19.
Artigo em Inglês | MEDLINE | ID: mdl-28927463

RESUMO

BACKGROUND: Next Generation Sequencing (NGS) is playing a key role in therapeutic decision making for the cancer prognosis and treatment. The NGS technologies are producing a massive amount of sequencing datasets. Often, these datasets are published from the isolated and different sequencing facilities. Consequently, the process of sharing and aggregating multisite sequencing datasets are thwarted by issues such as the need to discover relevant data from different sources, built scalable repositories, the automation of data linkage, the volume of the data, efficient querying mechanism, and information rich intuitive visualisation. RESULTS: We present an approach to link and query different sequencing datasets (TCGA, COSMIC, REACTOME, KEGG and GO) to indicate risks for four cancer types - Ovarian Serous Cystadenocarcinoma (OV), Uterine Corpus Endometrial Carcinoma (UCEC), Uterine Carcinosarcoma (UCS), Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC) - covering the 16 healthy tissue-specific genes from Illumina Human Body Map 2.0. The differentially expressed genes from Illumina Human Body Map 2.0 are analysed together with the gene expressions reported in COSMIC and TCGA repositories leading to the discover of potential biomarkers for a tissue-specific cancer. CONCLUSION: We analyse the tissue expression of genes, copy number variation (CNV), somatic mutation, and promoter methylation to identify associated pathways and find novel biomarkers. We discovered twenty (20) mutated genes and three (3) potential pathways causing promoter changes in different gynaecological cancer types. We propose a data-interlinked platform called BIOOPENER that glues together heterogeneous cancer and biomedical repositories. The key approach is to find correspondences (or data links) among genetic, cellular and molecular features across isolated cancer datasets giving insight into cancer progression from normal to diseased tissues. The proposed BIOOPENER platform enriches mutations by filling in missing links from TCGA, COSMIC, REACTOME, KEGG and GO datasets and provides an interlinking mechanism to understand cancer progression from normal to diseased tissues with pathway components, which in turn helped to map mutations, associated phenotypes, pathways, and mechanism.


Assuntos
Biomarcadores Tumorais/metabolismo , Doenças dos Genitais Femininos/metabolismo , Informática Médica/métodos , Medicina de Precisão/métodos , Mineração de Dados , Bases de Dados Factuais , Feminino , Humanos
15.
J Biomed Semantics ; 8(1): 13, 2017 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-28298238

RESUMO

BACKGROUND: Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain. METHODS: The efficient cataloguing approach of the federated query processing system 'BioFed', the triple pattern wise source selection and the semantic source normalisation forms the core to our solution. It gathers and integrates data from newly identified public endpoints for federated access. Basic provenance information is linked to the retrieved data. Last but not least, BioFed makes use of the latest SPARQL standard (i.e., 1.1) to leverage the full benefits for query federation. The evaluation is based on 10 simple and 10 complex queries, which address data in 10 major and very popular data sources (e.g., Dugbank, Sider). RESULTS: BioFed is a solution for a single-point-of-access for a large number of SPARQL endpoints providing life science data. It facilitates efficient query generation for data access and provides basic provenance information in combination with the retrieved data. BioFed fully supports SPARQL 1.1 and gives access to the endpoint's availability based on the EndpointData graph. Our evaluation of BioFed against FedX is based on 20 heterogeneous federated SPARQL queries and shows competitive execution performance in comparison to FedX, which can be attributed to the provision of provenance information for the source selection. CONCLUSION: Developing and testing federated query engines for life sciences data is still a challenging task. According to our findings, it is advantageous to optimise the source selection. The cataloguing of SPARQL endpoints, including type and property indexing, leads to efficient querying of data resources over the Web of Data. This could even be further improved through the use of ontologies, e.g., for abstract normalisation of query terms.


Assuntos
Ontologias Biológicas , Disciplinas das Ciências Biológicas , Mineração de Dados/métodos , Mineração de Dados/normas , Internet , Padrões de Referência , Interface Usuário-Computador
16.
J Biomed Semantics ; 8(1): 5, 2017 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-28148277

RESUMO

BACKGROUND: Several query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data. RESULTS: We present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE's indexing system does not hold any data instances-it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change. CONCLUSIONS: We validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.


Assuntos
Ontologias Biológicas , Armazenamento e Recuperação da Informação , Algoritmos , Confidencialidade , Bases de Dados Factuais , Humanos , Software
17.
Brief Bioinform ; 17(5): 819-30, 2016 09.
Artigo em Inglês | MEDLINE | ID: mdl-26420780

RESUMO

Phenotypes have gained increased notoriety in the clinical and biological domain owing to their application in numerous areas such as the discovery of disease genes and drug targets, phylogenetics and pharmacogenomics. Phenotypes, defined as observable characteristics of organisms, can be seen as one of the bridges that lead to a translation of experimental findings into clinical applications and thereby support 'bench to bedside' efforts. However, to build this translational bridge, a common and universal understanding of phenotypes is required that goes beyond domain-specific definitions. To achieve this ambitious goal, a digital revolution is ongoing that enables the encoding of data in computer-readable formats and the data storage in specialized repositories, ready for integration, enabling translational research. While phenome research is an ongoing endeavor, the true potential hidden in the currently available data still needs to be unlocked, offering exciting opportunities for the forthcoming years. Here, we provide insights into the state-of-the-art in digital phenotyping, by means of representing, acquiring and analyzing phenotype data. In addition, we provide visions of this field for future research work that could enable better applications of phenotype data.


Assuntos
Fenótipo , Humanos , Armazenamento e Recuperação da Informação , Projetos de Pesquisa , Pesquisa Translacional Biomédica
18.
Artigo em Inglês | MEDLINE | ID: mdl-26507285

RESUMO

Analysis of scientific and clinical phenotypes reported in the experimental literature has been curated manually to build high-quality databases such as the Online Mendelian Inheritance in Man (OMIM). However, the identification and harmonization of phenotype descriptions struggles with the diversity of human expressivity. We introduce a novel automated extraction approach called PhenoMiner that exploits full parsing and conceptual analysis. Apriori association mining is then used to identify relationships to human diseases. We applied PhenoMiner to the BMC open access collection and identified 13,636 phenotype candidates. We identified 28,155 phenotype-disorder hypotheses covering 4898 phenotypes and 1659 Mendelian disorders. Analysis showed: (i) the semantic distribution of the extracted terms against linked ontologies; (ii) a comparison of term overlap with the Human Phenotype Ontology (HP); (iii) moderate support for phenotype-disorder pairs in both OMIM and the literature; (iv) strong associations of phenotype-disorder pairs to known disease-genes pairs using PhenoDigm. The full list of PhenoMiner phenotypes (S1), phenotype-disorder associations (S2), association-filtered linked data (S3) and user database documentation (S5) is available as supplementary data and can be downloaded at http://github.com/nhcollier/PhenoMiner under a Creative Commons Attribution 4.0 license. Database URL: phenominer.mml.cam.ac.uk.


Assuntos
Mineração de Dados , Bases de Dados como Assunto , Bases de Dados Genéticas , Software , Animais , Animais Congênicos , Pressão Sanguínea , Cromossomos de Mamíferos/genética , Dieta , Humanos , Hipertensão/complicações , Hipertensão/fisiopatologia , Hipertensão/urina , Nefropatias/complicações , Nefropatias/fisiopatologia , Nefropatias/urina , Modelos Biológicos , Fenótipo , Ratos , Ratos Mutantes , Sódio/urina , Sístole
19.
BMC Bioinformatics ; 16 Suppl 10: S3, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26202680

RESUMO

BACKGROUND: The third edition of the BioNLP Shared Task was held with the grand theme "knowledge base construction (KB)". The Genia Event (GE) task was re-designed and implemented in light of this theme. For its final report, the participating systems were evaluated from a perspective of annotation. To further explore the grand theme, we extended the evaluation from a perspective of KB construction. Also, the Gene Regulation Ontology (GRO) task was newly introduced in the third edition. The final evaluation of the participating systems resulted in relatively low performance. The reason was attributed to the large size and complex semantic representation of the ontology. To investigate potential benefits of resource exchange between the presumably similar tasks, we measured the overlap between the datasets of the two tasks, and tested whether the dataset for one task can be used to enhance performance on the other. RESULTS: We report an extended evaluation on all the participating systems in the GE task, incoporating a KB perspective. For the evaluation, the final submission of each participant was converted to RDF statements, and evaluated using 8 queries that were formulated in SPARQL. The results suggest that the evaluation may be concluded differently between the two different perspectives, annotation vs. KB. We also provide a comparison of the GE and GRO tasks by converting their datasets into each other's format. More than 90% of the GE data could be converted into the GRO task format, while only half of the GRO data could be mapped to the GE task format. The imbalance in conversion indicates that the GRO is a comprehensive extension of the GE task ontology. We further used the converted GRO data as additional training data for the GE task, which helped improve GE task participant system performance. However, the converted GE data did not help GRO task participants, due to overfitting and the ontology gap.


Assuntos
Regulação da Expressão Gênica , Ontologia Genética , Redes Reguladoras de Genes , Genes , Armazenamento e Recuperação da Informação , Bases de Conhecimento , Processamento de Linguagem Natural , Bases de Dados Factuais , Humanos , Semântica
20.
J Am Med Inform Assoc ; 22(5): 948-56, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-25948699

RESUMO

OBJECTIVE: To create a multilingual gold-standard corpus for biomedical concept recognition. MATERIALS AND METHODS: We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. RESULTS: The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. DISCUSSION: The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. CONCLUSION: To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.


Assuntos
Armazenamento e Recuperação da Informação/métodos , Multilinguismo , Processamento de Linguagem Natural , Terminologia como Assunto , Semântica , Unified Medical Language System
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA