RESUMO
MOTIVATION: Transcriptional regulation is performed by transcription factors (TF) binding to DNA in context-dependent regulatory regions and determines the activation or inhibition of gene expression. Current methods of transcriptional regulatory circuits inference, based on one or all of TF, regions and genes activity measurements require a large number of samples for ranking the candidate TF-gene regulation relations and rarely predict whether they are activations or inhibitions. We hypothesize that transcriptional regulatory circuits can be inferred from fewer samples by (1) fully integrating information on TF binding, gene expression and regulatory regions accessibility, (2) reducing data complexity and (3) using biology-based likelihood constraints to determine the global consistency between a candidate TF-gene relation and patterns of genes expressions and region activations, as well as qualify regulations as activations or inhibitions. RESULTS: We introduce Regulus, a method which computes TF-gene relations from gene expressions, regulatory region activities and TF binding sites data, together with the genomic locations of all entities. After aggregating gene expressions and region activities into patterns, data are integrated into a RDF (Resource Description Framework) endpoint. A dedicated SPARQL (SPARQL Protocol and RDF Query Language) query retrieves all potential relations between expressed TF and genes involving active regulatory regions. These TF-region-gene relations are then filtered using biological likelihood constraints allowing to qualify them as activation or inhibition. Regulus provides signed relations consistent with public databases and, when applied to biological data, identifies both known and potential new regulators. Regulus is devoted to context-specific transcriptional circuits inference in human settings where samples are scarce and cell populations are closely related, using discretization into patterns and likelihood reasoning to decipher the most robust regulatory relations.
Assuntos
Regulação da Expressão Gênica , Fatores de Transcrição , Humanos , Regulação da Expressão Gênica/genética , Fatores de Transcrição/metabolismo , Genômica/métodos , Bases de Dados Factuais , Ligação Proteica , Redes Reguladoras de Genes/genéticaRESUMO
MOTIVATION: Molecular complexes play a major role in the regulation of biological pathways. The Biological Pathway Exchange format (BioPAX) facilitates the integration of data sources describing interactions some of which involving complexes. The BioPAX specification explicitly prevents complexes to have any component that is another complex (unless this component is a black-box complex whose composition is unknown). However, we observed that the well-curated Reactome pathway database contains such recursive complexes of complexes. We propose reproductible and semantically rich SPARQL queries for identifying and fixing invalid complexes in BioPAX databases, and evaluate the consequences of fixing these nonconformities in the Reactome database. RESULTS: For the Homo sapiens version of Reactome, we identify 5833 recursively defined complexes out of the 14 987 complexes (39%). This situation is not specific to the Human dataset, as all tested species of Reactome exhibit between 30% (Plasmodium falciparum) and 40% (Sus scrofa, Bos taurus, Canis familiaris, and Gallus gallus) of recursive complexes. As an additional consequence, the procedure also allows the detection of complex redundancies. Overall, this method improves the conformity and the automated analysis of the graph by repairing the topology of the complexes in the graph. This will allow to apply further reasoning methods on better consistent data. AVAILABILITY AND IMPLEMENTATION: We provide a Jupyter notebook detailing the analysis https://github.com/cjuigne/non_conformities_detection_biopax.
Assuntos
Galinhas , Web Semântica , Humanos , Animais , Bovinos , Cães , Bases de Dados Factuais , Plasmodium falciparumRESUMO
BACKGROUND: In life sciences, there has been a long-standing effort of standardization and integration of reference datasets and databases. Despite these efforts, many studies data are provided using specific and non-standard formats. This hampers the capacity to reuse the studies data in other pipelines, the capacity to reuse the pipelines results in other studies, and the capacity to enrich the data with additional information. The Regulatory Circuits project is one of the largest efforts for integrating human cell genomics data to predict tissue-specific transcription factor-genes interaction networks. In spite of its success, it exhibits the usual shortcomings limiting its update, its reuse (as a whole or partially), and its extension with new data samples. To address these limitations, the resource has previously been integrated in an RDF triplestore so that TF-gene interaction networks could be generated with two SPARQL queries. However, this triplestore did not store the computed networks and did not integrate metadata about tissues and samples, therefore limiting the reuse of this dataset. In particular, it does not enable to reuse only a portion of Regulatory Circuits if a study focuses on a subset of the tissues, nor to combine the samples described in the datasets with samples from other studies. Overall, these limitations advocate for the design of a complete, flexible and reusable representation of the Regulatory Circuits dataset based on Semantic Web technologies. RESULTS: We provide a modular RDF representation of the Regulatory Circuits, called Linked Extended Regulatory Circuits (LERC). It consists in (i) descriptions of biological and experimental context mapped to the references databases, (ii) annotations about TF-gene interactions at the sample level for 808 samples, (iii) annotations about TF-gene interactions at the tissue level for 394 tissues, (iv) metadata connecting the knowledge graphs cited above. LERC is based on a modular organisation into 1,205 RDF named graphs for representing the biological data, the sample-specific and the tissue-specific networks, and the corresponding metadata. In total it contains 3,910,794,050 triples and is available as a SPARQL endpoint. CONCLUSION: The flexible and modular architecture of LERC supports biologically-relevant SPARQL queries. It allows an easy and fast querying of the resources related to the initial Regulatory Circuits datasets and facilitates its reuse in other studies. ASSOCIATED WEBSITE: https://regulatorycircuits-lod.genouest.org.
Assuntos
Disciplinas das Ciências Biológicas , Animais , Bases de Dados Factuais , Humanos , Estágios do Ciclo de Vida , MetadadosRESUMO
MOTIVATION: Information on protein-protein interactions is collected in numerous primary databases with their own curation process. Several meta-databases aggregate primary databases to provide more exhaustive datasets. In addition to exhaustivity, aggregation contributes to reliability by providing an overview of the various studies and detection methods supporting an interaction. However, interactions listed in different primary databases are partly redundant because some publications reporting protein-protein interactions have been curated by multiple primary databases. Mere aggregation can thus introduce a bias if these redundancies are not identified and eliminated. To overcome this bias, meta-databases rely on the Molecular Interaction ontology that describes interaction detection methods, but they do not fully take advantage of the ontology's rich semantics, which leads to systematically overestimating interaction reproducibility. RESULTS: We propose a precise definition of explicit and implicit redundancy and show that both can be easily detected using Semantic Web technologies. We apply this process to a dataset from the Agile Protein Interactomes DataServer (APID) meta-database and show that while explicit redundancies were detected by the APID aggregation process, about 15% of APID entries are implicitly redundant and should not be taken into account when presenting confidence-related metrics. More than 90% of implicit redundancies result from the aggregation of distinct primary databases, whereas the remaining occurs between entries of a single database. Finally, we build a 'reproducible interactome' with interactions that have been reproduced by multiple methods or publications. The size of the reproducible interactome is drastically impacted by removing redundancies for both yeast (-59%) and human (-56%), and we show that this is largely due to implicit redundancies. AVAILABILITY AND IMPLEMENTATION: Software, data and results are available at https://gitlab.com/nnet56/reproducible-interactome, https://reproducible-interactome.genouest.org/, Zenodo (https://doi.org/10.5281/zenodo.5595037) and NDEx (https://doi.org/10.18119/N94302 and https://doi.org/10.18119/N97S4D). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Mapeamento de Interação de Proteínas , Semântica , Software , Humanos , Bases de Dados de Proteínas , Reprodutibilidade dos Testes , Mapeamento de Interação de Proteínas/métodosRESUMO
Omics technologies offer great promises for improving our understanding of diseases. The integration and interpretation of such data pose major challenges, calling for adequate knowledge models. Disease maps provide curated knowledge about disorders' pathophysiology at the molecular level adapted to omics measurements. However, the expressiveness of disease maps could be increased to help in avoiding ambiguities and misinterpretations and to reinforce their interoperability with other knowledge resources. Ontology is an adequate framework to overcome this limitation, through their axiomatic definitions and logical reasoning properties. We introduce the Disease Map Ontology (DMO), an ontological upper model based on systems biology terms. We then propose to apply DMO to Alzheimer's disease (AD). Specifically, we use it to drive the conversion of AlzPathway, a disease map devoted to AD, into a formal ontology: Alzheimer DMO. We demonstrate that it allows one to deal with issues related to redundancy, naming, consistency, process classification and pathway relationships. Furthermore, we show that it can store and manage multi-omics data. Finally, we expand the model using elements from other resources, such as clinical features contained in the AD Ontology, resulting in an enriched model called ADMO-plus. The current versions of DMO, ADMO and ADMO-plus are freely available at http://bioportal.bioontology.org/ontologies/ADMO.
Assuntos
Doença de Alzheimer , Ontologias Biológicas , Doença de Alzheimer/genética , Humanos , Conhecimento , Biologia de SistemasRESUMO
BACKGROUND & AIMS: Activation of hepatic stellate cells (HSC) is a critical process involved in liver fibrosis. Several miRNAs are implicated in gene regulation during this process but their exact and respective contribution is still incompletely understood. Here we propose an integrative approach of miRNA-regulatory networks to predict new targets. METHODS: miRNA regulatory networks in activated HSCs were built using lists of validated miRNAs and the CyTargetLinker tool. The resulting graphs were filtered according to public transcriptomic data and the reduced graphs were analysed through GO annotation. A miRNA network regulating the expression of TIMP3 was further studied in human liver samples, isolated hepatic cells and mouse model of liver fibrosis. RESULTS: Within the up-regulated miRNAs, we identified a subnetwork of five miRNAs (miR-21-5p, miR-222-3p, miR-221-3p miR-181b-5p and miR-17-5p) that target TIMP3. We demonstrated that TIMP3 expression is inversely associated with inflammatory activity and IL1-ß expression in vivo. We further showed that IL1-ß inhibits TIMP3 expression in HSC-derived LX-2 cells. Using data from The Cancer Genome Atlas (TCGA), we showed that, in hepatocellular carcinoma (HCC), TIMP3 expression is associated with survival (P < .001), while miR-221 (P < .05), miR-222 (P < .01) and miR-181b (P < .01) are markers for a poor prognosis. CONCLUSIONS: Several miRNAs targeting TIMP3 are up-regulated in activated HSCs and down-regulation of TIMP3 expression is associated with inflammatory activity in liver fibrosis and poor prognosis in HCC. The regulatory network including specific miRNAs and TIMP3 is therefore central for the evolution of chronic liver disease.
Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , MicroRNAs , Carcinoma Hepatocelular/genética , Células Estreladas do Fígado , Humanos , Cirrose Hepática/genética , Neoplasias Hepáticas/genética , MicroRNAs/genética , Inibidor Tecidual de Metaloproteinase-3/genéticaRESUMO
In order to predict the behavior of a biological system, one common approach is to perform a simulation on a dynamic model. Boolean networks allow to analyze the qualitative aspects of the model by identifying its steady states and attractors. Each of them, when possible, is associated with a phenotype which conveys a biological interpretation. Phenotypes are characterized by their signatures, provided by domain experts. The number of steady states tends to increase with the network size and the number of simulation conditions, which makes the biological interpretation difficult. As a first step, we explore the use of Formal Concept Analysis as a symbolic bi-clustering technics to classify and sort the steady states of a Boolean network according to biological signatures based on the hierarchy of the roles the network components play in the phenotypes. FCA generates a lattice structure describing the dependencies between proteins in the signature and steady-states of the Boolean network. We use this lattice (i) to enrich the biological signatures according to the dependencies carried by the network dynamics, (ii) to identify variants to the phenotypes and (iii) to characterize hybrid phenotypes. We applied our approach on a T helper lymphocyte (Th) differentiation network with a set of signatures corresponding to the sub-types of Th. Our method generated the same classification as a manual analysis performed by experts in the field, and was also able to work under extended simulation conditions. This led to the identification and prediction of a new hybrid sub-type later confirmed by the literature.
Assuntos
Redes Reguladoras de Genes , Fenótipo , Animais , Diferenciação Celular , Simulação por Computador , Humanos , Modelos Biológicos , Modelos Genéticos , Linfócitos T Auxiliares-Indutores/classificaçãoRESUMO
Genome-scale metabolic models have become the tool of choice for the global analysis of microorganism metabolism, and their reconstruction has attained high standards of quality and reliability. Improvements in this area have been accompanied by the development of some major platforms and databases, and an explosion of individual bioinformatics methods. Consequently, many recent models result from "à la carte" pipelines, combining the use of platforms, individual tools and biological expertise to enhance the quality of the reconstruction. Although very useful, introducing heterogeneous tools, that hardly interact with each other, causes loss of traceability and reproducibility in the reconstruction process. This represents a real obstacle, especially when considering less studied species whose metabolic reconstruction can greatly benefit from the comparison to good quality models of related organisms. This work proposes an adaptable workspace, AuReMe, for sustainable reconstructions or improvements of genome-scale metabolic models involving personalized pipelines. At each step, relevant information related to the modifications brought to the model by a method is stored. This ensures that the process is reproducible and documented regardless of the combination of tools used. Additionally, the workspace establishes a way to browse metabolic models and their metadata through the automatic generation of ad-hoc local wikis dedicated to monitoring and facilitating the process of reconstruction. AuReMe supports exploration and semantic query based on RDF databases. We illustrate how this workspace allowed handling, in an integrated way, the metabolic reconstructions of non-model organisms such as an extremophile bacterium or eukaryote algae. Among relevant applications, the latter reconstruction led to putative evolutionary insights of a metabolic pathway.
Assuntos
Bases de Dados Factuais , Genômica , Armazenamento e Recuperação da Informação , Internet , Redes e Vias Metabólicas/genética , Antioxidantes/metabolismo , Genômica/métodos , Genômica/normas , Armazenamento e Recuperação da Informação/métodos , Armazenamento e Recuperação da Informação/normas , Microalgas/genética , Microalgas/metabolismo , Modelos Teóricos , Reprodutibilidade dos TestesRESUMO
The number of patients with complications associated with chronic diseases increases with the ageing population. In particular, complex chronic wounds raise the re-admission rate in hospitals. In this context, the implementation of a telemedicine application in Basse-Normandie, France, contributes to reduce hospital stays and transport. This application requires a new collaboration among general practitioners, private duty nurses and the hospital staff. However, the main constraint mentioned by the users of this system is the lack of interoperability between the information system of this application and various partners' information systems. To improve medical data exchanges, the authors propose a new implementation based on the introduction of interoperable clinical documents and a digital document repository for managing the sharing of the documents between the telemedicine application users. They then show that this technical solution is suitable for any telemedicine application and any document sharing system in a healthcare facility or network.
RESUMO
The number of patients that benefit from remote monitoring of cardiac implantable electronic devices, such as pacemakers and defibrillators, is growing rapidly. Consequently, the huge number of alerts that are generated and transmitted to the physicians represents a challenge to handle. We have developed a system based on a formal ontology that integrates the alert information and the patient data extracted from the electronic health record in order to better classify the importance of alerts. A pilot study was conducted on atrial fibrillation alerts. We show some examples of alert processing. The results suggest that this approach has the potential to significantly reduce the alert burden in telecardiology. The methods may be extended to other types of connected devices.
Assuntos
Fibrilação Atrial/diagnóstico , Alarmes Clínicos , Sistemas de Apoio a Decisões Clínicas/organização & administração , Eletrocardiografia Ambulatorial/métodos , Registros Eletrônicos de Saúde/organização & administração , Telemedicina/métodos , Fibrilação Atrial/prevenção & controle , Ontologias Biológicas , Desfibriladores Implantáveis , Diagnóstico por Computador/métodos , Humanos , Processamento de Linguagem Natural , Marca-Passo Artificial , Projetos Piloto , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Terapia Assistida por Computador/métodosRESUMO
AIMS: Remote monitoring of cardiac implantable electronic devices is a growing standard; yet, remote follow-up and management of alerts represents a time-consuming task for physicians or trained staff. This study evaluates an automatic mechanism based on artificial intelligence tools to filter atrial fibrillation (AF) alerts based on their medical significance. METHODS AND RESULTS: We evaluated this method on alerts for AF episodes that occurred in 60 pacemaker recipients. AKENATON prototype workflow includes two steps: natural language-processing algorithms abstract the patient health record to a digital version, then a knowledge-based algorithm based on an applied formal ontology allows to calculate the CHA2DS2-VASc score and evaluate the anticoagulation status of the patient. Each alert is then automatically classified by importance from low to critical, by mimicking medical reasoning. Final classification was compared with human expert analysis by two physicians. A total of 1783 alerts about AF episode >5 min in 60 patients were processed. A 1749 of 1783 alerts (98%) were adequately classified and there were no underestimation of alert importance in the remaining 34 misclassified alerts. CONCLUSION: This work demonstrates the ability of a pilot system to classify alerts and improves personalized remote monitoring of patients. In particular, our method allows integration of patient medical history with device alert notifications, which is useful both from medical and resource-management perspectives. The system was able to automatically classify the importance of 1783 AF alerts in 60 patients, which resulted in an 84% reduction in notification workload, while preserving patient safety.
Assuntos
Fibrilação Atrial/diagnóstico , Eletrocardiografia/instrumentação , Sistema de Condução Cardíaco/fisiopatologia , Frequência Cardíaca , Marca-Passo Artificial , Telemetria/instrumentação , Potenciais de Ação , Algoritmos , Anticoagulantes/uso terapêutico , Inteligência Artificial , Fibrilação Atrial/fisiopatologia , Fibrilação Atrial/terapia , Automação , Técnicas de Apoio para a Decisão , França , Humanos , Projetos Piloto , Valor Preditivo dos Testes , Reprodutibilidade dos Testes , Estudos Retrospectivos , Medição de Risco , Processamento de Sinais Assistido por Computador , Fluxo de Trabalho , Carga de TrabalhoRESUMO
BACKGROUND: The analysis of gene annotations referencing back to Gene Ontology plays an important role in the interpretation of high-throughput experiments results. This analysis typically involves semantic similarity and particularity measures that quantify the importance of the Gene Ontology annotations. However, there is currently no sound method supporting the interpretation of the similarity and particularity values in order to determine whether two genes are similar or whether one gene has some significant particular function. Interpretation is frequently based either on an implicit threshold, or an arbitrary one (typically 0.5). Here we investigate a method for determining thresholds supporting the interpretation of the results of a semantic comparison. RESULTS: We propose a method for determining the optimal similarity threshold by minimizing the proportions of false-positive and false-negative similarity matches. We compared the distributions of the similarity values of pairs of similar genes and pairs of non-similar genes. These comparisons were performed separately for all three branches of the Gene Ontology. In all situations, we found overlap between the similar and the non-similar distributions, indicating that some similar genes had a similarity value lower than the similarity value of some non-similar genes. We then extend this method to the semantic particularity measure and to a similarity measure applied to the ChEBI ontology. Thresholds were evaluated over the whole HomoloGene database. For each group of homologous genes, we computed all the similarity and particularity values between pairs of genes. Finally, we focused on the PPAR multigene family to show that the similarity and particularity patterns obtained with our thresholds were better at discriminating orthologs and paralogs than those obtained using default thresholds. CONCLUSION: We developed a method for determining optimal semantic similarity and particularity thresholds. We applied this method on the GO and ChEBI ontologies. Qualitative analysis using the thresholds on the PPAR multigene family yielded biologically-relevant patterns.
Assuntos
Redes e Vias Metabólicas/genética , Algoritmos , Biologia Computacional/métodos , Ontologia Genética , Humanos , Anotação de Sequência Molecular/métodos , Família Multigênica/genética , Receptores Ativados por Proliferador de Peroxissomo/genética , SemânticaRESUMO
OBJECTIVE: New technologies improve modern medicine, but may result in unwanted consequences. Some occur due to inadequate human-computer-interactions (HCI). To assess these consequences, an investigation model was developed to facilitate the planning, implementation and documentation of studies for HCI in surgery. METHODS AND MATERIAL: The investigation model was formalized in Unified Modeling Language and implemented as an ontology. Four different top-level ontologies were compared: Object-Centered High-level Reference, Basic Formal Ontology, General Formal Ontology (GFO) and Descriptive Ontology for Linguistic and Cognitive Engineering, according to the three major requirements of the investigation model: the domain-specific view, the experimental scenario and the representation of fundamental relations. Furthermore, this article emphasizes the distinction of "information model" and "model of meaning" and shows the advantages of implementing the model in an ontology rather than in a database. RESULTS: The results of the comparison show that GFO fits the defined requirements adequately: the domain-specific view and the fundamental relations can be implemented directly, only the representation of the experimental scenario requires minor extensions. The other candidates require wide-ranging extensions, concerning at least one of the major implementation requirements. Therefore, the GFO was selected to realize an appropriate implementation of the developed investigation model. The ensuing development considered the concrete implementation of further model aspects and entities: sub-domains, space and time, processes, properties, relations and functions. CONCLUSIONS: The investigation model and its ontological implementation provide a modular guideline for study planning, implementation and documentation within the area of HCI research in surgery. This guideline helps to navigate through the whole study process in the form of a kind of standard or good clinical practice, based on the involved foundational frameworks. Furthermore, it allows to acquire the structured description of the applied assessment methods within a certain surgical domain and to consider this information for own study design or to perform a comparison of different studies. The investigation model and the corresponding ontology can be used further to create new knowledge bases of HCI assessment in surgery.
Assuntos
Projetos de Pesquisa , Cirurgia Assistida por Computador , Interface Usuário-Computador , Automação , Humanos , Bases de Conhecimento , Modelos TeóricosRESUMO
Brown algae (stramenopiles) are key players in intertidal ecosystems, and represent a source of biomass with several industrial applications. Ectocarpus siliculosus is a model to study the biology of these organisms. Its genome has been sequenced and a number of post-genomic tools have been implemented. Based on this knowledge, we report the reconstruction and analysis of a genome-scale metabolic network for E. siliculosus, EctoGEM (http://ectogem.irisa.fr). This atlas of metabolic pathways consists of 1866 reactions and 2020 metabolites, and its construction was performed by means of an integrative computational approach for identifying metabolic pathways, gap filling and manual refinement. The capability of the network to produce biomass was validated by flux balance analysis. EctoGEM enabled the reannotation of 56 genes within the E. siliculosus genome, and shed light on the evolution of metabolic processes. For example, E. siliculosus has the potential to produce phenylalanine and tyrosine from prephenate and arogenate, but does not possess a phenylalanine hydroxylase, as is found in other stramenopiles. It also possesses the complete eukaryote molybdenum co-factor biosynthesis pathway, as well as a second molybdopterin synthase that was most likely acquired via horizontal gene transfer from cyanobacteria by a common ancestor of stramenopiles. EctoGEM represents an evolving community resource to gain deeper understanding of the biology of brown algae and the diversification of physiological processes. The integrative computational method applied for its reconstruction will be valuable to set up similar approaches for other organisms distant from biological benchmark models.
Assuntos
Genoma de Planta , Phaeophyceae/fisiologia , Dados de Sequência Molecular , Phaeophyceae/genética , Phaeophyceae/metabolismoRESUMO
BACKGROUND: Meat quality depends on skeletal muscle structure and metabolic properties. While most studies carried on pigs focus on the Longissimus muscle (LM) for fresh meat consumption, Semimembranosus (SM) is also of interest because of its importance for cooked ham production. Even if both muscles are classified as glycolytic muscles, they exhibit dissimilar myofiber composition and metabolic characteristics. The comparison of LM and SM transcriptome profiles undertaken in this study may thus clarify the biological events underlying their phenotypic differences which might influence several meat quality traits. METHODOLOGY/PRINCIPAL FINDINGS: Muscular transcriptome analyses were performed using a custom pig muscle microarray: the 15 K Genmascqchip. A total of 3823 genes were differentially expressed between the two muscles (Benjamini-Hochberg adjusted P value ≤0.05), out of which 1690 and 2133 were overrepresented in LM and SM respectively. The microarray data were validated using the expression level of seven differentially expressed genes quantified by real-time RT-PCR. A set of 1047 differentially expressed genes with a muscle fold change ratio above 1.5 was used for functional characterization. Functional annotation emphasized five main clusters associated to transcriptome muscle differences. These five clusters were related to energy metabolism, cell cycle, gene expression, anatomical structure development and signal transduction/immune response. CONCLUSIONS/SIGNIFICANCE: This study revealed strong transcriptome differences between LM and SM. These results suggest that skeletal muscle discrepancies might arise essentially from different post-natal myogenic activities.
Assuntos
Músculo Esquelético/metabolismo , Sus scrofa/genética , Animais , Perfilação da Expressão Gênica , Carne , Sus scrofa/metabolismo , Suínos , Análise Serial de Tecidos , TranscriptomaRESUMO
BACKGROUND: Ensuring that all cancer patients have access to the appropriate treatment within an appropriate time is a strategic priority in many countries. There is in particular a need to describe and analyse cancer care trajectories and to produce waiting time indicators. We developed an algorithm for extracting temporally represented care trajectories from coded information collected routinely by the general cancer Registry in Poitou-Charentes region, France. The present work aimed to assess the performance of this algorithm on real-life patient data in the setting of non-metastatic breast cancer, using measures of similarity. METHODS: Care trajectories were modeled as ordered dated events aggregated into states, the granularity of which was defined from standard care guidelines. The algorithm generates each state from the aggregation over a period of tracer events characterised on the basis of diagnoses and medical procedures. The sequences are presented in simple form showing presence and order of the states, and in an extended form that integrates the duration of the states. The similarity of the sequences, which are represented in the form of chains of characters, was calculated using a generalised Levenshtein distance. RESULTS: The evaluation was performed on a sample of 159 female patients whose itineraries were also calculated manually from medical records using the same aggregation rules and dating system as the algorithm. Ninety-eight per cent of the trajectories were correctly reconstructed with respect to the ordering of states. When the duration of states was taken into account, 94% of the trajectories matched reality within three days. Dissimilarities between sequences were mainly due to the absence of certain pathology reports and to coding anomalies in hospitalisation data. CONCLUSIONS: These results show the ability of an integrated regional information system to formalise care trajectories and automatically produce indicators for time-lapse to care instatement, of interest in the planning of care in cancer. The next step will consist in evaluating this approach and extending it to more complex trajectories (metastasis, relapse) and to other cancer localisations.
Assuntos
Algoritmos , Neoplasias da Mama/terapia , Registros Eletrônicos de Saúde , Sistemas de Informação em Saúde/normas , Ensaios Clínicos Controlados Aleatórios como Assunto/normas , Sistema de Registros , Adulto , Idoso , Idoso de 80 Anos ou mais , Neoplasias da Mama/epidemiologia , Feminino , França , Humanos , Pessoa de Meia-Idade , Fatores de TempoRESUMO
BACKGROUND: Genetic and genomic data analyses are outputting large sets of genes. Functional comparison of these gene sets is a key part of the analysis, as it identifies their shared functions, and the functions that distinguish each set. The Gene Ontology (GO) initiative provides a unified reference for analyzing the genes molecular functions, biological processes and cellular components. Numerous semantic similarity measures have been developed to systematically quantify the weight of the GO terms shared by two genes. We studied how gene set comparisons can be improved by considering gene set particularity in addition to gene set similarity. RESULTS: We propose a new approach to compute gene set particularities based on the information conveyed by GO terms. A GO term informativeness can be computed using either its information content based on the term frequency in a corpus, or a function of the term's distance to the root. We defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2. We combined our particularity measure with a similarity measure to compare gene sets. We demonstrated that the combination of semantic similarity and semantic particularity measures was able to identify genes with particular functions from among similar genes. This differentiation was not recognized using only a semantic similarity measure. CONCLUSION: Semantic particularity should be used in conjunction with semantic similarity to perform functional analysis of GO-annotated gene sets. The principle is generalizable to other ontologies.
Assuntos
Bases de Dados Genéticas , Ontologia Genética , Genes , Semântica , Animais , Aquaporinas/metabolismo , Transporte Biológico , Genes Fúngicos , Humanos , Carioferinas/genética , Ratos , Saccharomyces cerevisiae/genética , Homologia de Sequência do Ácido Nucleico , Triptofano/metabolismoRESUMO
Ontologies support automatic sharing, combination and analysis of life sciences data. They undergo regular curation and enrichment. We studied the impact of an ontology evolution on its structural complexity. As a case study we used the sixty monthly releases between January 2008 and December 2012 of the Gene Ontology and its three independent branches, i.e. biological processes (BP), cellular components (CC) and molecular functions (MF). For each case, we measured complexity by computing metrics related to the size, the nodes connectivity and the hierarchical structure. The number of classes and relations increased monotonously for each branch, with different growth rates. BP and CC had similar connectivity, superior to that of MF. Connectivity increased monotonously for BP, decreased for CC and remained stable for MF, with a marked increase for the three branches in November and December 2012. Hierarchy-related measures showed that CC and MF had similar proportions of leaves, average depths and average heights. BP had a lower proportion of leaves, and a higher average depth and average height. For BP and MF, the late 2012 increase of connectivity resulted in an increase of the average depth and average height and a decrease of the proportion of leaves, indicating that a major enrichment effort of the intermediate-level hierarchy occurred. The variation of the number of classes and relations in an ontology does not provide enough information about the evolution of its complexity. However, connectivity and hierarchy-related metrics revealed different patterns of values as well as of evolution for the three branches of the Gene Ontology. CC was similar to BP in terms of connectivity, and similar to MF in terms of hierarchy. Overall, BP complexity increased, CC was refined with the addition of leaves providing a finer level of annotations but decreasing slightly its complexity, and MF complexity remained stable.
Assuntos
Biologia Computacional/história , Ontologia Genética/tendências , Vocabulário Controlado/história , Ontologia Genética/estatística & dados numéricos , História do Século XXI , Humanos , Fatores de TempoRESUMO
BACKGROUND: Clinical trials are important for patients, for researchers and for companies. One of the major bottlenecks is patient recruitment. This task requires the matching of a large volume of information about the patient with numerous eligibility criteria, in a logically-complex combination. Moreover, some of the patient's information necessary to determine the status of the eligibility criteria may not be available at the time of pre-screening. RESULTS: We showed that the classic approach based on negation as failure over-estimates rejection when confronted with partially-known information about the eligibility criteria because it ignores the distinction between a trial for which patient eligibility should be rejected and trials for which patient eligibility cannot be asserted. We have also shown that 58.64% of the values were unknown in the 286 prostate cancer cases examined during the weekly urology multidisciplinary meetings at Rennes' university hospital between October 2008 and March 2009.We propose an OWL design pattern for modeling eligibility criteria based on the open world assumption to address the missing information problem. We validate our model on a fictitious clinical trial and evaluate it on two real clinical trials. Our approach successfully distinguished clinical trials for which the patient is eligible, clinical trials for which we know that the patient is not eligible and clinical trials for which the patient may be eligible provided that further pieces of information (which we can identify) can be obtained. CONCLUSIONS: OWL-based reasoning based on the open world assumption provides an adequate framework for distinguishing those patients who can confidently be rejected from those whose status cannot be determined. The expected benefits are a reduction of the workload of the physicians and a higher efficiency by allowing them to focus on the patients whose eligibility actually require expertise.
RESUMO
OBJECTIVE: Biomedical research increasingly relies on the integration of information from multiple heterogeneous data sources. Despite the fact that structural and terminological aspects of interoperability are interdependent and rely on a common set of requirements, current efforts typically address them in isolation. We propose a unified ontology-based knowledge framework to facilitate interoperability between heterogeneous sources, and investigate if using the LexEVS terminology server is a viable implementation method. MATERIALS AND METHODS: We developed a framework based on an ontology, the general information model (GIM), to unify structural models and terminologies, together with relevant mapping sets. This allowed a uniform access to these resources within LexEVS to facilitate interoperability by various components and data sources from implementing architectures. RESULTS: Our unified framework has been tested in the context of the EU Framework Program 7 TRANSFoRm project, where it was used to achieve data integration in a retrospective diabetes cohort study. The GIM was successfully instantiated in TRANSFoRm as the clinical data integration model, and necessary mappings were created to support effective information retrieval for software tools in the project. CONCLUSIONS: We present a novel, unifying approach to address interoperability challenges in heterogeneous data sources, by representing structural and semantic models in one framework. Systems using this architecture can rely solely on the GIM that abstracts over both the structure and coding. Information models, terminologies and mappings are all stored in LexEVS and can be accessed in a uniform manner (implementing the HL7 CTS2 service functional model). The system is flexible and should reduce the effort needed from data sources personnel for implementing and managing the integration.