Search | VHL Regional Portal

1.

Automatic Outlier Detection in Laboratory Result Distributions Within a Real World Data Network.

Muñoz Monjas, Aída; Rubio Ruiz, David; Pérez-Rey, David; Palchuk, Matvey.

Stud Health Technol Inform ; 302: 88-92, 2023 May 18.

Article in English | MEDLINE | ID: mdl-37203615

ABSTRACT

Laboratory data must be interoperable to be able to accurately compare the results of a lab test between healthcare organizations. To achieve this, terminologies like LOINC (Logical Observation Identifiers, Names and Codes) provide unique identification codes for laboratory tests. Once standardized, the numeric results of laboratory tests can be aggregated and represented in histograms. Due to the characteristics of Real World Data (RWD), outliers and abnormal values are common, but these cases should be treated as exceptions, excluding them from possible analysis. The proposed work analyses two methods capable of automating the selection of histogram limits to sanitize the generated lab test result distributions, Tukey's box-plot method and a "Distance to Density" approach, within the TriNetX Real World Data Network. The generated limits using clinical RWD are generally wider for Tukey's method and narrower for the second method, both greatly dependent on the values used for the algorithm's parameters.

Subject(s)

Laboratories , Logical Observation Identifiers Names and Codes

2.

A global federated real-world data and analytics platform for research.

Palchuk, Matvey B; London, Jack W; Perez-Rey, David; Drebert, Zuzanna J; Winer-Jones, Jessamine P; Thompson, Courtney N; Esposito, John; Claerhout, Brecht.

JAMIA Open ; 6(2): ooad035, 2023 Jul.

Article in English | MEDLINE | ID: mdl-37193038

ABSTRACT

Objective: This article describes a scalable, performant, sustainable global network of electronic health record data for biomedical and clinical research. Materials and Methods: TriNetX has created a technology platform characterized by a conservative security and governance model that facilitates collaboration and cooperation between industry participants, such as pharmaceutical companies and contract research organizations, and academic and community-based healthcare organizations (HCOs). HCOs participate on the network in return for access to a suite of analytics capabilities, large networks of de-identified data, and more sponsored trial opportunities. Industry participants provide the financial resources to support, expand, and improve the technology platform in return for access to network data, which provides increased efficiencies in clinical trial design and deployment. Results: TriNetX is a growing global network, expanding from 55 HCOs and 7 countries in 2017 to over 220 HCOs and 30 countries in 2022. Over 19 000 sponsored clinical trial opportunities have been initiated through the TriNetX network. There have been over 350 peer-reviewed scientific publications based on the network's data. Conclusions: The continued growth of the TriNetX network and its yield of clinical trial collaborations and published studies indicates that this academic-industry structure is a safe, proven, sustainable path for building and maintaining research-centric data networks.

3.

Building an i2b2-Based Population Repository for COVID-19 Research.

Pedrera-Jimenez, Miguel; Garcia-Barrio, Noelia; Hernandez-Ibarburu, Gema; Baselga, Blanca; Blanco, Alvar; Calvo-Boyero, Fernando; Gutierrez-Sacristan, Alba; Quiros, Víctor; Cruz-Bermudez, Juan Luis; Bernal, José Luis; Meloni, Laura; Perez-Rey, David; Palchuk, Matvey; Kohane, Isaac; Serrano, Pablo.

Stud Health Technol Inform ; 294: 287-291, 2022 May 25.

Article in English | MEDLINE | ID: mdl-35612078

ABSTRACT

Reuse of Electronic Health Records (EHRs) for specific diseases such as COVID-19 requires data to be recorded and persisted according to international standards. Since the beginning of the COVID-19 pandemic, Hospital Universitario 12 de Octubre (H12O) evolved its EHRs: it identified, modeled and standardized the concepts related to this new disease in an agile, flexible and staged way. Thus, data from more than 200,000 COVID-19 cases were extracted, transformed, and loaded into an i2b2 repository. This effort allowed H12O to share data with worldwide networks such as the TriNetX platform and the 4CE Consortium.

Subject(s)

COVID-19 , COVID-19/epidemiology , Electronic Health Records , Humans , Pandemics

4.

Building an I2B2-Based Population Repository for Clinical Research.

González, Lydia; Pérez-Rey, David; Alonso, Enrique; Hernández, Gema; Serrano, Pablo; Pedrera, Miguel; Gómez, Agustín; De Schepper, Kristof; Crepain, Titus; Claerhout, Brecht.

Stud Health Technol Inform ; 270: 78-82, 2020 Jun 16.

Article in English | MEDLINE | ID: mdl-32570350

ABSTRACT

The present work provides a real-world case of the connection process of a hospital, 12 de Octubre University Hospital in Spain, to the TriNetX research network, transforming a compilation of disparate sources into a single harmonized repository which is automatically refreshed every day. It describes the different integration phases: terminology core datasets, specialized sources and eventually automatic refreshment. It also explains the work performed on semantic normalization of the involved clinical terminologies; as well as the resulting benefits the InSite platform services have enabled in the form of research opportunities for the hospital.

Subject(s)

Semantics , Systematized Nomenclature of Medicine , Spain

5.

A semantic interoperability approach to support integration of gene expression and clinical data in breast cancer.

Alonso-Calvo, Raul; Paraiso-Medina, Sergio; Perez-Rey, David; Alonso-Oset, Enrique; van Stiphout, Ruud; Yu, Sheng; Taylor, Marian; Buffa, Francesca; Fernandez-Lozano, Carlos; Pazos, Alejandro; Maojo, Victor.

Comput Biol Med ; 87: 179-186, 2017 08 01.

Article in English | MEDLINE | ID: mdl-28601027

ABSTRACT

INTRODUCTION: The introduction of omics data and advances in technologies involved in clinical treatment has led to a broad range of approaches to represent clinical information. Within this context, patient stratification across health institutions due to omic profiling presents a complex scenario to carry out multi-center clinical trials. METHODS: This paper presents a standards-based approach to ensure semantic integration required to facilitate the analysis of clinico-genomic clinical trials. To ensure interoperability across different institutions, we have developed a Semantic Interoperability Layer (SIL) to facilitate homogeneous access to clinical and genetic information, based on different well-established biomedical standards and following International Health (IHE) recommendations. RESULTS: The SIL has shown suitability for integrating biomedical knowledge and technologies to match the latest clinical advances in healthcare and the use of genomic information. This genomic data integration in the SIL has been tested with a diagnostic classifier tool that takes advantage of harmonized multi-center clinico-genomic data for training statistical predictive models. CONCLUSIONS: The SIL has been adopted in national and international research initiatives, such as the EURECA-EU research project and the CIMED collaborative Spanish project, where the proposed solution has been applied and evaluated by clinical experts focused on clinico-genomic studies.

Subject(s)

Breast Neoplasms/genetics , Gene Expression , Semantics , Female , Humans

6.

Cohort Selection and Management Application Leveraging Standards-based Semantic Interoperability and a Groovy DSL.

Bucur, Anca; van Leeuwen, Jasper; Chen, Njin-Zu; Claerhout, Brecht; de Schepper, Kristof; Perez-Rey, David; Paraiso-Medina, Sergio; Alonso-Calvo, Raul; Mehta, Keyur; Krykwinski, Cyril.

AMIA Jt Summits Transl Sci Proc ; 2016: 25-32, 2016.

Article in English | MEDLINE | ID: mdl-27570644

ABSTRACT

This paper describes a new Cohort Selection application implemented to support streamlining the definition phase of multi-centric clinical research in oncology. Our approach aims at both ease of use and precision in defining the selection filters expressing the characteristics of the desired population. The application leverages our standards-based Semantic Interoperability Solution and a Groovy DSL to provide high expressiveness in the definition of filters and flexibility in their composition into complex selection graphs including splits and merges. Widely-adopted ontologies such as SNOMED-CT are used to represent the semantics of the data and to express concepts in the application filters, facilitating data sharing and collaboration on joint research questions in large communities of clinical users. The application supports patient data exploration and efficient collaboration in multi-site, heterogeneous and distributed data environments.

7.

The INTEGRATE project: Delivering solutions for efficient multi-centric clinical research and trials.

Kondylakis, Haridimos; Claerhout, Brecht; Keyur, Mehta; Koumakis, Lefteris; van Leeuwen, Jasper; Marias, Kostas; Perez-Rey, David; De Schepper, Kristof; Tsiknakis, Manolis; Bucur, Anca.

J Biomed Inform ; 62: 32-47, 2016 08.

Article in English | MEDLINE | ID: mdl-27224847

ABSTRACT

The objective of the INTEGRATE project (http://www.fp7-integrate.eu/) that has recently concluded successfully was the development of innovative biomedical applications focused on streamlining the execution of clinical research, on enabling multidisciplinary collaboration, on management and large-scale sharing of multi-level heterogeneous datasets, and on the development of new methodologies and of predictive multi-scale models in cancer. In this paper, we present the way the INTEGRATE consortium has approached important challenges such as the integration of multi-scale biomedical data in the context of post-genomic clinical trials, the development of predictive models and the implementation of tools to facilitate the efficient execution of postgenomic multi-centric clinical trials in breast cancer. Furthermore, we provide a number of key "lessons learned" during the process and give directions for further future research and development.

Subject(s)

Biomedical Research , Database Management Systems , Genomics , Breast Neoplasms/genetics , Clinical Trials as Topic , Computational Biology , Databases, Factual , Humans

8.

Case Study for Integration of an Oncology Clinical Site in a Semantic Interoperability Solution based on HL7 v3 and SNOMED-CT: Data Transformation Needs.

Ibrahim, Ahmed; Bucur, Anca; Perez-Rey, David; Alonso, Enrique; de Hoog, Matthy; Dekker, Andre; Marshall, M Scott.

AMIA Jt Summits Transl Sci Proc ; 2015: 71, 2015.

Article in English | MEDLINE | ID: mdl-26306242

ABSTRACT

This paper describes the data transformation pipeline defined to support the integration of a new clinical site in a standards-based semantic interoperability environment. The available datasets combined structured and free-text patient data in Dutch, collected in the context of radiation therapy in several cancer types. Our approach aims at both efficiency and data quality. We combine custom-developed scripts, standard tools and manual validation by clinical and knowledge experts. We identified key challenges emerging from the several sources of heterogeneity in our case study (systems, language, data structure, clinical domain) and implemented solutions that we will further generalize for the integration of new sites. We conclude that the required effort for data transformation is manageable which supports the feasibility of our semantic interoperability solution. The achieved semantic interoperability will be leveraged for the deployment and evaluation at the clinical site of applications enabling secondary use of care data for research. This work has been funded by the European Commission through the INTEGRATE (FP7-ICT-2009-6-270253) and EURECA (FP7-ICT-2011-288048) projects.

9.

Enabling semantic interoperability in multi-centric clinical trials on breast cancer.

Alonso-Calvo, Raul; Perez-Rey, David; Paraiso-Medina, Sergio; Claerhout, Brecht; Hennebert, Philippe; Bucur, Anca.

Comput Methods Programs Biomed ; 118(3): 322-9, 2015 Mar.

Article in English | MEDLINE | ID: mdl-25682737

ABSTRACT

BACKGROUND AND OBJECTIVES: Post-genomic clinical trials require the participation of multiple institutions, and collecting data from several hospitals, laboratories and research facilities. This paper presents a standard-based solution to provide a uniform access endpoint to patient data involved in current clinical research. METHODS: The proposed approach exploits well-established standards such as HL7 v3 or SPARQL and medical vocabularies such as SNOMED CT, LOINC and HGNC. A novel mechanism to exploit semantic normalization among HL7-based data models and biomedical ontologies has been created by using Semantic Web technologies. RESULTS: Different types of queries have been used for testing the semantic interoperability solution described in this paper. The execution times obtained in the tests enable the development of end user tools within a framework that requires efficient retrieval of integrated data. CONCLUSIONS: The proposed approach has been successfully tested by applications within the INTEGRATE and EURECA EU projects. These applications have been deployed and tested for: (i) patient screening, (ii) trial recruitment, and (iii) retrospective analysis; exploiting semantically interoperable access to clinical patient data from heterogeneous data sources.

Subject(s)

Breast Neoplasms/therapy , Clinical Trials as Topic/statistics & numerical data , Computational Biology , Database Management Systems/statistics & numerical data , Databases, Factual/statistics & numerical data , Female , Humans , Information Storage and Retrieval/statistics & numerical data , Internet , Multicenter Studies as Topic/statistics & numerical data , Terminology as Topic

10.

Semantic Normalization and Query Abstraction Based on SNOMED-CT and HL7: Supporting Multicentric Clinical Trials.

Paraiso-Medina, Sergio; Perez-Rey, David; Bucur, Anca; Claerhout, Brecht; Alonso-Calvo, Raul.

IEEE J Biomed Health Inform ; 19(3): 1061-7, 2015 May.

Article in English | MEDLINE | ID: mdl-25248204

ABSTRACT

Advances in the use of omic data and other biomarkers are increasing the number of variables in clinical research. Additional data have stratified the population of patients and require that current studies be performed among multiple institutions. Semantic interoperability and standardized data representation are a crucial task in the management of modern clinical trials. In the past few years, different efforts have focused on integrating biomedical information. Due to the complexity of this domain and the specific requirements of clinical research, the majority of data integration tasks are still performed manually. This paper presents a semantic normalization process and a query abstraction mechanism to facilitate data integration and retrieval. A process based on well-established standards from the biomedical domain and the latest semantic web technologies has been developed. Methods proposed in this paper have been tested within the EURECA EU research project, where clinical scenarios require the extraction of semantic knowledge from biomedical vocabularies. The aim of this paper is to provide a novel method to abstract from the data model and query syntax. The proposed approach has been compared with other initiatives in the field by storing the same dataset with each of those solutions. Results show an extended functionality and query capabilities at the cost of slightly worse performance in query execution. Implementations in real settings have shown that following this approach, usable interfaces can be developed to exploit clinical trial data outcomes.

Subject(s)

Abstracting and Indexing/standards , Clinical Trials as Topic , Electronic Health Records , Systematized Nomenclature of Medicine , Humans

11.

Supporting patient screening to identify suitable clinical trials.

Bucur, Anca; Van Leeuwen, Jasper; Chen, Njin-Zu; Claerhout, Brecht; De Schepper, Kristof; Perez-Rey, David; Alonso-Calvo, Raul; Pugliano, Lina; Saini, Kamal.

Stud Health Technol Inform ; 205: 823-7, 2014.

Article in English | MEDLINE | ID: mdl-25160302

ABSTRACT

To support the efficient execution of post-genomic multi-centric clinical trials in breast cancer we propose a solution that streamlines the assessment of the eligibility of patients for available trials. The assessment of the eligibility of a patient for a trial requires evaluating whether each eligibility criterion is satisfied and is often a time consuming and manual task. The main focus in the literature has been on proposing different methods for modelling and formalizing the eligibility criteria. However the current adoption of these approaches in clinical care is limited. Less effort has been dedicated to the automatic matching of criteria to the patient data managed in clinical care. We address both aspects and propose a scalable, efficient and pragmatic patient screening solution enabling automatic evaluation of eligibility of patients for a relevant set of trials. This covers the flexible formalization of criteria and of other relevant trial metadata and the efficient management of these representations.

Subject(s)

Breast Neoplasms/therapy , Clinical Trials as Topic/methods , Data Mining/methods , Eligibility Determination/methods , Medical Records Systems, Computerized/organization & administration , Natural Language Processing , Patient Selection , Breast Neoplasms/diagnosis , Europe , Female , Humans , Medical Records Systems, Computerized/classification , Semantics , Vocabulary, Controlled

12.

A data model based on semantically enhanced HL7 RIM for sharing patient data of breast cancer clinical trials.

Moratilla, Juan M; Alonso-Calvo, Raul; Molina-Vaquero, Gema; Paraiso-Medina, Sergio; Perez-Rey, David; Maojo, Victor.

Stud Health Technol Inform ; 192: 971, 2013.

Article in English | MEDLINE | ID: mdl-23920745

ABSTRACT

Breast cancer clinical trial researchers have to handle heterogeneous data coming from different data sources, overloading biomedical researchers when they need to query data for retrospective analysis. This paper presents the Common Data Model (CDM) proposed within the INTEGRATE EU project to homogenize data coming from different clinical partners. This CDM is based on the Reference Information Model (RIM) from the Health Level 7 (HL7) version 3. Semantic capabilities through an SPARQL endpoint were also required to ensure the sustainability of the platform. For the SPARQL endpoint implementation, a comparison has been carried out between a Relational SQL database + D2R and a RDF database. The results show that the first option can store all clinical data received from institutions participating in the project with a better performance. It has been also evaluated by the EU Commission within a patient recruitment demonstrator.

Subject(s)

Breast Neoplasms/classification , Clinical Trials as Topic/standards , Health Level Seven , Information Storage and Retrieval/standards , Medical Record Linkage/standards , Semantics , Vocabulary, Controlled , Data Mining/standards , European Union , Female , Humans , Practice Guidelines as Topic , Systems Integration

13.

Analyzing SNOMED CT and HL7 terminology binding for semantic interoperability on post-genomic clinical trials.

Aso, Santiago; Perez-Rey, David; Alonso-Calvo, Raul; Rico-Diez, Antonio; Bucur, Anca; Claerhout, Brecht; Maojo, Victor.

Stud Health Technol Inform ; 192: 980, 2013.

Article in English | MEDLINE | ID: mdl-23920754

ABSTRACT

Current post-genomic clinical trials in cancer involve the collaboration of several institutions. Multi-centric retrospective analysis requires advanced methods to ensure semantic interoperability. In this scenario, the objective of the EU funded INTEGRATE project, is to provide an infrastructure to share knowledge and data in post-genomic breast cancer clinical trials. This paper presents the process carried out in this project, to bind domain terminologies in the area, such as SNOMED CT, with the HL7 v3 Reference Information Model (RIM). The proposed terminology binding follow the HL7 recommendations, but should also consider important issues such as overlapping concepts and domain terminology coverage. Although there are limitations due to the large heterogeneity of the data in the area, the proposed process has been successfully applied within the context of the INTEGRATE project. An improvement in semantic interoperability of patient data from modern breast cancer clinical trials, aims to enhance the clinical practice in oncology.

Subject(s)

Breast Neoplasms/classification , Clinical Trials as Topic/standards , Electronic Health Records/standards , Health Level Seven/standards , Natural Language Processing , Systematized Nomenclature of Medicine , Terminology as Topic , Breast Neoplasms/genetics , Breast Neoplasms/therapy , Female , Genomics/standards , Humans , Information Storage and Retrieval/standards , Medical Record Linkage/standards

14.

Using nanoinformatics methods for automatically identifying relevant nanotoxicology entities from the literature.

García-Remesal, Miguel; García-Ruiz, Alejandro; Pérez-Rey, David; de la Iglesia, Diana; Maojo, Víctor.

Biomed Res Int ; 2013: 410294, 2013.

Article in English | MEDLINE | ID: mdl-23509721

ABSTRACT

Nanoinformatics is an emerging research field that uses informatics techniques to collect, process, store, and retrieve data, information, and knowledge on nanoparticles, nanomaterials, and nanodevices and their potential applications in health care. In this paper, we have focused on the solutions that nanoinformatics can provide to facilitate nanotoxicology research. For this, we have taken a computational approach to automatically recognize and extract nanotoxicology-related entities from the scientific literature. The desired entities belong to four different categories: nanoparticles, routes of exposure, toxic effects, and targets. The entity recognizer was trained using a corpus that we specifically created for this purpose and was validated by two nanomedicine/nanotoxicology experts. We evaluated the performance of our entity recognizer using 10-fold cross-validation. The precisions range from 87.6% (targets) to 93.0% (routes of exposure), while recall values range from 82.6% (routes of exposure) to 87.4% (toxic effects). These results prove the feasibility of using computational approaches to reliably perform different named entity recognition (NER)-dependent tasks, such as for instance augmented reading or semantic searches. This research is a "proof of concept" that can be expanded to stimulate further developments that could assist researchers in managing data, information, and knowledge at the nanolevel, thus accelerating research in nanomedicine.

Subject(s)

Computational Biology/methods , Data Mining , Nanomedicine/methods , Algorithms , Databases, Bibliographic , Humans , Reproducibility of Results , Software

15.

CDAPubMed: a browser extension to retrieve EHR-based biomedical literature.

Perez-Rey, David; Jimenez-Castellanos, Ana; Garcia-Remesal, Miguel; Crespo, Jose; Maojo, Victor.

BMC Med Inform Decis Mak ; 12: 29, 2012 Apr 05.

Article in English | MEDLINE | ID: mdl-22480327

ABSTRACT

BACKGROUND: Over the last few decades, the ever-increasing output of scientific publications has led to new challenges to keep up to date with the literature. In the biomedical area, this growth has introduced new requirements for professionals, e.g., physicians, who have to locate the exact papers that they need for their clinical and research work amongst a huge number of publications. Against this backdrop, novel information retrieval methods are even more necessary. While web search engines are widespread in many areas, facilitating access to all kinds of information, additional tools are required to automatically link information retrieved from these engines to specific biomedical applications. In the case of clinical environments, this also means considering aspects such as patient data security and confidentiality or structured contents, e.g., electronic health records (EHRs). In this scenario, we have developed a new tool to facilitate query building to retrieve scientific literature related to EHRs. RESULTS: We have developed CDAPubMed, an open-source web browser extension to integrate EHR features in biomedical literature retrieval approaches. Clinical users can use CDAPubMed to: (i) load patient clinical documents, i.e., EHRs based on the Health Level 7-Clinical Document Architecture Standard (HL7-CDA), (ii) identify relevant terms for scientific literature search in these documents, i.e., Medical Subject Headings (MeSH), automatically driven by the CDAPubMed configuration, which advanced users can optimize to adapt to each specific situation, and (iii) generate and launch literature search queries to a major search engine, i.e., PubMed, to retrieve citations related to the EHR under examination. CONCLUSIONS: CDAPubMed is a platform-independent tool designed to facilitate literature searching using keywords contained in specific EHRs. CDAPubMed is visually integrated, as an extension of a widespread web browser, within the standard PubMed interface. It has been tested on a public dataset of HL7-CDA documents, returning significantly fewer citations since queries are focused on characteristics identified within the EHR. For instance, compared with more than 200,000 citations retrieved by breast neoplasm, fewer than ten citations were retrieved when ten patient features were added using CDAPubMed. This is an open source tool that can be freely used for non-profit purposes and integrated with other existing systems.

Subject(s)

Electronic Health Records , Information Storage and Retrieval/methods , Internet , Periodicals as Topic , PubMed , Documentation/standards , Medical Subject Headings , Software Design , Systems Integration

16.

PubDNA Finder: a web database linking full-text articles to sequences of nucleic acids.

García-Remesal, Miguel; Cuevas, Alejandro; Pérez-Rey, David; Martín, Luis; Anguita, Alberto; de la Iglesia, Diana; de la Calle, Guillermo; Crespo, José; Maojo, Víctor.

Bioinformatics ; 26(21): 2801-2, 2010 Nov 01.

Article in English | MEDLINE | ID: mdl-20829445

ABSTRACT

SUMMARY: PubDNA Finder is an online repository that we have created to link PubMed Central manuscripts to the sequences of nucleic acids appearing in them. It extends the search capabilities provided by PubMed Central by enabling researchers to perform advanced searches involving sequences of nucleic acids. This includes, among other features (i) searching for papers mentioning one or more specific sequences of nucleic acids and (ii) retrieving the genetic sequences appearing in different articles. These additional query capabilities are provided by a searchable index that we created by using the full text of the 176 672 papers available at PubMed Central at the time of writing and the sequences of nucleic acids appearing in them. To automatically extract the genetic sequences occurring in each paper, we used an original method we have developed. The database is updated monthly by automatically connecting to the PubMed Central FTP site to retrieve and index new manuscripts. Users can query the database via the web interface provided. AVAILABILITY: PubDNA Finder can be freely accessed at http://servet.dia.fi.upm.es:8080/pubdnafinder

Subject(s)

Base Sequence , Computational Biology/methods , Databases, Genetic , Internet , Nucleic Acids/chemistry , Software , PubMed

17.

A method for automatically extracting infectious disease-related primers and probes from the literature.

García-Remesal, Miguel; Cuevas, Alejandro; López-Alonso, Victoria; López-Campos, Guillermo; de la Calle, Guillermo; de la Iglesia, Diana; Pérez-Rey, David; Crespo, José; Martín-Sánchez, Fernando; Maojo, Víctor.

BMC Bioinformatics ; 11: 410, 2010 Aug 03.

Article in English | MEDLINE | ID: mdl-20682041

ABSTRACT

BACKGROUND: Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information. RESULTS: We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name. CONCLUSIONS: We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch.

Subject(s)

DNA Primers/genetics , DNA Probes/genetics , Data Mining , Databases, Genetic , Base Sequence , DNA Primers/chemistry , DNA Probes/chemistry , Periodicals as Topic

18.

Education and research in INFOBIOMED, the European Network of Excellence in Biomedical Informatics.

de la Calle, Guillermo; van Mulligen, Erik M; Molero, Eva; Perez-Rey, David; Martín, Luis; Crespo, José; Maojo, Victor.

AMIA Annu Symp Proc ; : 927, 2007 Oct 11.

Article in English | MEDLINE | ID: mdl-18694027

ABSTRACT

During the last three years several initiatives have been deployed within INFOBIOMED, the European Network of Excellence (NoE) in Biomedical Informatics, for promoting research and education. In the context of genomic medicine, four research pilots were designed. To address the informational complexities of such research problems, new educational approaches are needed.

Subject(s)

Medical Informatics/education , European Union , Genomics , International Educational Exchange

19.

An ontology-based and distributed KDD model for biomedical sources.

Perez-Rey, David; Anguita, Alberto; Crespo, Jose; Maojo, Victor.

AMIA Annu Symp Proc ; : 1074, 2007 Oct 11.

Article in English | MEDLINE | ID: mdl-18694172

ABSTRACT

Knowledge discovery approaches in modern biomedical research usually require to access heterogeneous and remote data sources in a distributed environment. Traditional KDD models assumed a central repository, lacking mechanisms to access decentralized databases. In such distributed environment, ontologies can be used in all the KDD phases. We present here a new model of ontology-based KDD approach to improve data preprocessing from heterogeneous sources.

Subject(s)

Artificial Intelligence , Information Management , Vocabulary, Controlled , Biomedical Research , Data Collection/methods , Information Storage and Retrieval

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL