RESUMEN
The objective of the INTEGRATE project (http://www.fp7-integrate.eu/) that has recently concluded successfully was the development of innovative biomedical applications focused on streamlining the execution of clinical research, on enabling multidisciplinary collaboration, on management and large-scale sharing of multi-level heterogeneous datasets, and on the development of new methodologies and of predictive multi-scale models in cancer. In this paper, we present the way the INTEGRATE consortium has approached important challenges such as the integration of multi-scale biomedical data in the context of post-genomic clinical trials, the development of predictive models and the implementation of tools to facilitate the efficient execution of postgenomic multi-centric clinical trials in breast cancer. Furthermore, we provide a number of key "lessons learned" during the process and give directions for further future research and development.
Asunto(s)
Investigación Biomédica , Sistemas de Administración de Bases de Datos , Genómica , Neoplasias de la Mama/genética , Ensayos Clínicos como Asunto , Biología Computacional , Bases de Datos Factuales , HumanosRESUMEN
Laboratory data must be interoperable to be able to accurately compare the results of a lab test between healthcare organizations. To achieve this, terminologies like LOINC (Logical Observation Identifiers, Names and Codes) provide unique identification codes for laboratory tests. Once standardized, the numeric results of laboratory tests can be aggregated and represented in histograms. Due to the characteristics of Real World Data (RWD), outliers and abnormal values are common, but these cases should be treated as exceptions, excluding them from possible analysis. The proposed work analyses two methods capable of automating the selection of histogram limits to sanitize the generated lab test result distributions, Tukey's box-plot method and a "Distance to Density" approach, within the TriNetX Real World Data Network. The generated limits using clinical RWD are generally wider for Tukey's method and narrower for the second method, both greatly dependent on the values used for the algorithm's parameters.
Asunto(s)
Laboratorios , Logical Observation Identifiers Names and CodesRESUMEN
Objective: This article describes a scalable, performant, sustainable global network of electronic health record data for biomedical and clinical research. Materials and Methods: TriNetX has created a technology platform characterized by a conservative security and governance model that facilitates collaboration and cooperation between industry participants, such as pharmaceutical companies and contract research organizations, and academic and community-based healthcare organizations (HCOs). HCOs participate on the network in return for access to a suite of analytics capabilities, large networks of de-identified data, and more sponsored trial opportunities. Industry participants provide the financial resources to support, expand, and improve the technology platform in return for access to network data, which provides increased efficiencies in clinical trial design and deployment. Results: TriNetX is a growing global network, expanding from 55 HCOs and 7 countries in 2017 to over 220 HCOs and 30 countries in 2022. Over 19 000 sponsored clinical trial opportunities have been initiated through the TriNetX network. There have been over 350 peer-reviewed scientific publications based on the network's data. Conclusions: The continued growth of the TriNetX network and its yield of clinical trial collaborations and published studies indicates that this academic-industry structure is a safe, proven, sustainable path for building and maintaining research-centric data networks.
RESUMEN
BACKGROUND: Over the last few decades, the ever-increasing output of scientific publications has led to new challenges to keep up to date with the literature. In the biomedical area, this growth has introduced new requirements for professionals, e.g., physicians, who have to locate the exact papers that they need for their clinical and research work amongst a huge number of publications. Against this backdrop, novel information retrieval methods are even more necessary. While web search engines are widespread in many areas, facilitating access to all kinds of information, additional tools are required to automatically link information retrieved from these engines to specific biomedical applications. In the case of clinical environments, this also means considering aspects such as patient data security and confidentiality or structured contents, e.g., electronic health records (EHRs). In this scenario, we have developed a new tool to facilitate query building to retrieve scientific literature related to EHRs. RESULTS: We have developed CDAPubMed, an open-source web browser extension to integrate EHR features in biomedical literature retrieval approaches. Clinical users can use CDAPubMed to: (i) load patient clinical documents, i.e., EHRs based on the Health Level 7-Clinical Document Architecture Standard (HL7-CDA), (ii) identify relevant terms for scientific literature search in these documents, i.e., Medical Subject Headings (MeSH), automatically driven by the CDAPubMed configuration, which advanced users can optimize to adapt to each specific situation, and (iii) generate and launch literature search queries to a major search engine, i.e., PubMed, to retrieve citations related to the EHR under examination. CONCLUSIONS: CDAPubMed is a platform-independent tool designed to facilitate literature searching using keywords contained in specific EHRs. CDAPubMed is visually integrated, as an extension of a widespread web browser, within the standard PubMed interface. It has been tested on a public dataset of HL7-CDA documents, returning significantly fewer citations since queries are focused on characteristics identified within the EHR. For instance, compared with more than 200,000 citations retrieved by breast neoplasm, fewer than ten citations were retrieved when ten patient features were added using CDAPubMed. This is an open source tool that can be freely used for non-profit purposes and integrated with other existing systems.
Asunto(s)
Registros Electrónicos de Salud , Almacenamiento y Recuperación de la Información/métodos , Internet , Publicaciones Periódicas como Asunto , PubMed , Documentación/normas , Medical Subject Headings , Diseño de Software , Integración de SistemasRESUMEN
Reuse of Electronic Health Records (EHRs) for specific diseases such as COVID-19 requires data to be recorded and persisted according to international standards. Since the beginning of the COVID-19 pandemic, Hospital Universitario 12 de Octubre (H12O) evolved its EHRs: it identified, modeled and standardized the concepts related to this new disease in an agile, flexible and staged way. Thus, data from more than 200,000 COVID-19 cases were extracted, transformed, and loaded into an i2b2 repository. This effort allowed H12O to share data with worldwide networks such as the TriNetX platform and the 4CE Consortium.
Asunto(s)
COVID-19 , COVID-19/epidemiología , Registros Electrónicos de Salud , Humanos , PandemiasRESUMEN
SUMMARY: PubDNA Finder is an online repository that we have created to link PubMed Central manuscripts to the sequences of nucleic acids appearing in them. It extends the search capabilities provided by PubMed Central by enabling researchers to perform advanced searches involving sequences of nucleic acids. This includes, among other features (i) searching for papers mentioning one or more specific sequences of nucleic acids and (ii) retrieving the genetic sequences appearing in different articles. These additional query capabilities are provided by a searchable index that we created by using the full text of the 176 672 papers available at PubMed Central at the time of writing and the sequences of nucleic acids appearing in them. To automatically extract the genetic sequences occurring in each paper, we used an original method we have developed. The database is updated monthly by automatically connecting to the PubMed Central FTP site to retrieve and index new manuscripts. Users can query the database via the web interface provided. AVAILABILITY: PubDNA Finder can be freely accessed at http://servet.dia.fi.upm.es:8080/pubdnafinder
Asunto(s)
Secuencia de Bases , Biología Computacional/métodos , Bases de Datos Genéticas , Internet , Ácidos Nucleicos/química , Programas Informáticos , PubMedRESUMEN
BACKGROUND: Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information. RESULTS: We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name. CONCLUSIONS: We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch.
Asunto(s)
Cartilla de ADN/genética , Sondas de ADN/genética , Minería de Datos , Bases de Datos Genéticas , Secuencia de Bases , Cartilla de ADN/química , Sondas de ADN/química , Publicaciones Periódicas como AsuntoRESUMEN
The present work provides a real-world case of the connection process of a hospital, 12 de Octubre University Hospital in Spain, to the TriNetX research network, transforming a compilation of disparate sources into a single harmonized repository which is automatically refreshed every day. It describes the different integration phases: terminology core datasets, specialized sources and eventually automatic refreshment. It also explains the work performed on semantic normalization of the involved clinical terminologies; as well as the resulting benefits the InSite platform services have enabled in the form of research opportunities for the hospital.
Asunto(s)
Semántica , Systematized Nomenclature of Medicine , EspañaRESUMEN
INTRODUCTION: The introduction of omics data and advances in technologies involved in clinical treatment has led to a broad range of approaches to represent clinical information. Within this context, patient stratification across health institutions due to omic profiling presents a complex scenario to carry out multi-center clinical trials. METHODS: This paper presents a standards-based approach to ensure semantic integration required to facilitate the analysis of clinico-genomic clinical trials. To ensure interoperability across different institutions, we have developed a Semantic Interoperability Layer (SIL) to facilitate homogeneous access to clinical and genetic information, based on different well-established biomedical standards and following International Health (IHE) recommendations. RESULTS: The SIL has shown suitability for integrating biomedical knowledge and technologies to match the latest clinical advances in healthcare and the use of genomic information. This genomic data integration in the SIL has been tested with a diagnostic classifier tool that takes advantage of harmonized multi-center clinico-genomic data for training statistical predictive models. CONCLUSIONS: The SIL has been adopted in national and international research initiatives, such as the EURECA-EU research project and the CIMED collaborative Spanish project, where the proposed solution has been applied and evaluated by clinical experts focused on clinico-genomic studies.
Asunto(s)
Neoplasias de la Mama/genética , Expresión Génica , Semántica , Femenino , HumanosRESUMEN
This paper describes a new Cohort Selection application implemented to support streamlining the definition phase of multi-centric clinical research in oncology. Our approach aims at both ease of use and precision in defining the selection filters expressing the characteristics of the desired population. The application leverages our standards-based Semantic Interoperability Solution and a Groovy DSL to provide high expressiveness in the definition of filters and flexibility in their composition into complex selection graphs including splits and merges. Widely-adopted ontologies such as SNOMED-CT are used to represent the semantics of the data and to express concepts in the application filters, facilitating data sharing and collaboration on joint research questions in large communities of clinical users. The application supports patient data exploration and efficient collaboration in multi-site, heterogeneous and distributed data environments.
RESUMEN
Advances in the use of omic data and other biomarkers are increasing the number of variables in clinical research. Additional data have stratified the population of patients and require that current studies be performed among multiple institutions. Semantic interoperability and standardized data representation are a crucial task in the management of modern clinical trials. In the past few years, different efforts have focused on integrating biomedical information. Due to the complexity of this domain and the specific requirements of clinical research, the majority of data integration tasks are still performed manually. This paper presents a semantic normalization process and a query abstraction mechanism to facilitate data integration and retrieval. A process based on well-established standards from the biomedical domain and the latest semantic web technologies has been developed. Methods proposed in this paper have been tested within the EURECA EU research project, where clinical scenarios require the extraction of semantic knowledge from biomedical vocabularies. The aim of this paper is to provide a novel method to abstract from the data model and query syntax. The proposed approach has been compared with other initiatives in the field by storing the same dataset with each of those solutions. Results show an extended functionality and query capabilities at the cost of slightly worse performance in query execution. Implementations in real settings have shown that following this approach, usable interfaces can be developed to exploit clinical trial data outcomes.
Asunto(s)
Indización y Redacción de Resúmenes/normas , Ensayos Clínicos como Asunto , Registros Electrónicos de Salud , Systematized Nomenclature of Medicine , HumanosRESUMEN
BACKGROUND AND OBJECTIVES: Post-genomic clinical trials require the participation of multiple institutions, and collecting data from several hospitals, laboratories and research facilities. This paper presents a standard-based solution to provide a uniform access endpoint to patient data involved in current clinical research. METHODS: The proposed approach exploits well-established standards such as HL7 v3 or SPARQL and medical vocabularies such as SNOMED CT, LOINC and HGNC. A novel mechanism to exploit semantic normalization among HL7-based data models and biomedical ontologies has been created by using Semantic Web technologies. RESULTS: Different types of queries have been used for testing the semantic interoperability solution described in this paper. The execution times obtained in the tests enable the development of end user tools within a framework that requires efficient retrieval of integrated data. CONCLUSIONS: The proposed approach has been successfully tested by applications within the INTEGRATE and EURECA EU projects. These applications have been deployed and tested for: (i) patient screening, (ii) trial recruitment, and (iii) retrospective analysis; exploiting semantically interoperable access to clinical patient data from heterogeneous data sources.
Asunto(s)
Neoplasias de la Mama/terapia , Ensayos Clínicos como Asunto/estadística & datos numéricos , Biología Computacional , Sistemas de Administración de Bases de Datos/estadística & datos numéricos , Bases de Datos Factuales/estadística & datos numéricos , Femenino , Humanos , Almacenamiento y Recuperación de la Información/estadística & datos numéricos , Internet , Estudios Multicéntricos como Asunto/estadística & datos numéricos , Terminología como AsuntoRESUMEN
This paper describes the data transformation pipeline defined to support the integration of a new clinical site in a standards-based semantic interoperability environment. The available datasets combined structured and free-text patient data in Dutch, collected in the context of radiation therapy in several cancer types. Our approach aims at both efficiency and data quality. We combine custom-developed scripts, standard tools and manual validation by clinical and knowledge experts. We identified key challenges emerging from the several sources of heterogeneity in our case study (systems, language, data structure, clinical domain) and implemented solutions that we will further generalize for the integration of new sites. We conclude that the required effort for data transformation is manageable which supports the feasibility of our semantic interoperability solution. The achieved semantic interoperability will be leveraged for the deployment and evaluation at the clinical site of applications enabling secondary use of care data for research. This work has been funded by the European Commission through the INTEGRATE (FP7-ICT-2009-6-270253) and EURECA (FP7-ICT-2011-288048) projects.
RESUMEN
To support the efficient execution of post-genomic multi-centric clinical trials in breast cancer we propose a solution that streamlines the assessment of the eligibility of patients for available trials. The assessment of the eligibility of a patient for a trial requires evaluating whether each eligibility criterion is satisfied and is often a time consuming and manual task. The main focus in the literature has been on proposing different methods for modelling and formalizing the eligibility criteria. However the current adoption of these approaches in clinical care is limited. Less effort has been dedicated to the automatic matching of criteria to the patient data managed in clinical care. We address both aspects and propose a scalable, efficient and pragmatic patient screening solution enabling automatic evaluation of eligibility of patients for a relevant set of trials. This covers the flexible formalization of criteria and of other relevant trial metadata and the efficient management of these representations.
Asunto(s)
Neoplasias de la Mama/terapia , Ensayos Clínicos como Asunto/métodos , Minería de Datos/métodos , Determinación de la Elegibilidad/métodos , Sistemas de Registros Médicos Computarizados/organización & administración , Procesamiento de Lenguaje Natural , Selección de Paciente , Neoplasias de la Mama/diagnóstico , Europa (Continente) , Femenino , Humanos , Sistemas de Registros Médicos Computarizados/clasificación , Semántica , Vocabulario ControladoRESUMEN
Nanoinformatics is an emerging research field that uses informatics techniques to collect, process, store, and retrieve data, information, and knowledge on nanoparticles, nanomaterials, and nanodevices and their potential applications in health care. In this paper, we have focused on the solutions that nanoinformatics can provide to facilitate nanotoxicology research. For this, we have taken a computational approach to automatically recognize and extract nanotoxicology-related entities from the scientific literature. The desired entities belong to four different categories: nanoparticles, routes of exposure, toxic effects, and targets. The entity recognizer was trained using a corpus that we specifically created for this purpose and was validated by two nanomedicine/nanotoxicology experts. We evaluated the performance of our entity recognizer using 10-fold cross-validation. The precisions range from 87.6% (targets) to 93.0% (routes of exposure), while recall values range from 82.6% (routes of exposure) to 87.4% (toxic effects). These results prove the feasibility of using computational approaches to reliably perform different named entity recognition (NER)-dependent tasks, such as for instance augmented reading or semantic searches. This research is a "proof of concept" that can be expanded to stimulate further developments that could assist researchers in managing data, information, and knowledge at the nanolevel, thus accelerating research in nanomedicine.
Asunto(s)
Biología Computacional/métodos , Minería de Datos , Nanomedicina/métodos , Algoritmos , Bases de Datos Bibliográficas , Humanos , Reproducibilidad de los Resultados , Programas InformáticosRESUMEN
Breast cancer clinical trial researchers have to handle heterogeneous data coming from different data sources, overloading biomedical researchers when they need to query data for retrospective analysis. This paper presents the Common Data Model (CDM) proposed within the INTEGRATE EU project to homogenize data coming from different clinical partners. This CDM is based on the Reference Information Model (RIM) from the Health Level 7 (HL7) version 3. Semantic capabilities through an SPARQL endpoint were also required to ensure the sustainability of the platform. For the SPARQL endpoint implementation, a comparison has been carried out between a Relational SQL database + D2R and a RDF database. The results show that the first option can store all clinical data received from institutions participating in the project with a better performance. It has been also evaluated by the EU Commission within a patient recruitment demonstrator.
Asunto(s)
Neoplasias de la Mama/clasificación , Ensayos Clínicos como Asunto/normas , Estándar HL7 , Almacenamiento y Recuperación de la Información/normas , Registro Médico Coordinado/normas , Semántica , Vocabulario Controlado , Minería de Datos/normas , Unión Europea , Femenino , Humanos , Guías de Práctica Clínica como Asunto , Integración de SistemasRESUMEN
Current post-genomic clinical trials in cancer involve the collaboration of several institutions. Multi-centric retrospective analysis requires advanced methods to ensure semantic interoperability. In this scenario, the objective of the EU funded INTEGRATE project, is to provide an infrastructure to share knowledge and data in post-genomic breast cancer clinical trials. This paper presents the process carried out in this project, to bind domain terminologies in the area, such as SNOMED CT, with the HL7 v3 Reference Information Model (RIM). The proposed terminology binding follow the HL7 recommendations, but should also consider important issues such as overlapping concepts and domain terminology coverage. Although there are limitations due to the large heterogeneity of the data in the area, the proposed process has been successfully applied within the context of the INTEGRATE project. An improvement in semantic interoperability of patient data from modern breast cancer clinical trials, aims to enhance the clinical practice in oncology.
Asunto(s)
Neoplasias de la Mama/clasificación , Ensayos Clínicos como Asunto/normas , Registros Electrónicos de Salud/normas , Estándar HL7/normas , Procesamiento de Lenguaje Natural , Systematized Nomenclature of Medicine , Terminología como Asunto , Neoplasias de la Mama/genética , Neoplasias de la Mama/terapia , Femenino , Genómica/normas , Humanos , Almacenamiento y Recuperación de la Información/normas , Registro Médico Coordinado/normasRESUMEN
Knowledge discovery approaches in modern biomedical research usually require to access heterogeneous and remote data sources in a distributed environment. Traditional KDD models assumed a central repository, lacking mechanisms to access decentralized databases. In such distributed environment, ontologies can be used in all the KDD phases. We present here a new model of ontology-based KDD approach to improve data preprocessing from heterogeneous sources.
Asunto(s)
Inteligencia Artificial , Gestión de la Información , Vocabulario Controlado , Investigación Biomédica , Recolección de Datos/métodos , Almacenamiento y Recuperación de la InformaciónRESUMEN
During the last three years several initiatives have been deployed within INFOBIOMED, the European Network of Excellence (NoE) in Biomedical Informatics, for promoting research and education. In the context of genomic medicine, four research pilots were designed. To address the informational complexities of such research problems, new educational approaches are needed.