Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
BMJ Open ; 14(7): e079760, 2024 Jul 11.
Article in English | MEDLINE | ID: mdl-38991678

ABSTRACT

OBJECTIVES: In the midst of the pandemic, face-to-face data collection for national censuses and surveys was suspended due to limitations on mobility and social distancing, limiting the collection of already scarce disability data. Responses to these constraints were met with a surge of high-frequency phone surveys (HFPSs) that aimed to provide timely data for understanding the socioeconomic impacts of and responses to the pandemic. This paper provides an assessment of HFPS datasets and their inclusion of disability questions to evaluate the visibility of persons with disabilities during the COVID-19 pandemic. DESIGN: We collected HFPS questionnaires conducted globally from the onset of the pandemic emergency in March 2020 until December 2022 from various online survey repositories. Each HFPS questionnaire was searched using a set of keywords for inclusion of different types of disability questions. Results were recorded in an Excel review log, which was manually reviewed by two researchers. METHODS: The review of HFPS datasets involved two stages: (1) a main review of 294 HFPS dataset-waves and (2) a semiautomated review of the same dataset-waves using a search engine-powered questionnaire review tool developed by our team. The results from the main review were compared with those of a sensitivity analysis using and testing the tool as an alternative to manual search. RESULTS: Roughly half of HFPS datasets reviewed and 60% of the countries included in this study had some type of question on disability. While disability questions were not widely absent from HFPS datasets, only 3% of HFPS datasets included functional difficulty questions that meet international standards. The search engine-powered questionnaire review tool proved to be able to streamline the search process for future research on inclusive data. CONCLUSIONS: The dearth of functional difficulty questions and the Washington-Group Short Set in particular in HFPS has contributed to the relative invisibility of persons with disabilities during the pandemic emergency, the lingering effects of which could impede policy-making, monitoring and advocacy on behalf of persons with disabilities.


Subject(s)
COVID-19 , Disabled Persons , SARS-CoV-2 , Humans , COVID-19/epidemiology , Disabled Persons/statistics & numerical data , Surveys and Questionnaires , Pandemics , Telephone
2.
JMIR AI ; 3: e42630, 2024 May 02.
Article in English | MEDLINE | ID: mdl-38875551

ABSTRACT

BACKGROUND: Widespread misinformation in web resources can lead to serious implications for individuals seeking health advice. Despite that, information retrieval models are often focused only on the query-document relevance dimension to rank results. OBJECTIVE: We investigate a multidimensional information quality retrieval model based on deep learning to enhance the effectiveness of online health care information search results. METHODS: In this study, we simulated online health information search scenarios with a topic set of 32 different health-related inquiries and a corpus containing 1 billion web documents from the April 2019 snapshot of Common Crawl. Using state-of-the-art pretrained language models, we assessed the quality of the retrieved documents according to their usefulness, supportiveness, and credibility dimensions for a given search query on 6030 human-annotated, query-document pairs. We evaluated this approach using transfer learning and more specific domain adaptation techniques. RESULTS: In the transfer learning setting, the usefulness model provided the largest distinction between help- and harm-compatible documents, with a difference of +5.6%, leading to a majority of helpful documents in the top 10 retrieved. The supportiveness model achieved the best harm compatibility (+2.4%), while the combination of usefulness, supportiveness, and credibility models achieved the largest distinction between help- and harm-compatibility on helpful topics (+16.9%). In the domain adaptation setting, the linear combination of different models showed robust performance, with help-harm compatibility above +4.4% for all dimensions and going as high as +6.8%. CONCLUSIONS: These results suggest that integrating automatic ranking models created for specific information quality dimensions can increase the effectiveness of health-related information retrieval. Thus, our approach could be used to enhance searches made by individuals seeking online health information.

3.
Sci Data ; 11(1): 455, 2024 May 04.
Article in English | MEDLINE | ID: mdl-38704422

ABSTRACT

Due to the complexity of the biomedical domain, the ability to capture semantically meaningful representations of terms in context is a long-standing challenge. Despite important progress in the past years, no evaluation benchmark has been developed to evaluate how well language models represent biomedical concepts according to their corresponding context. Inspired by the Word-in-Context (WiC) benchmark, in which word sense disambiguation is reformulated as a binary classification task, we propose a novel dataset, BioWiC, to evaluate the ability of language models to encode biomedical terms in context. BioWiC comprises 20'156 instances, covering over 7'400 unique biomedical terms, making it the largest WiC dataset in the biomedical domain. We evaluate BioWiC both intrinsically and extrinsically and show that it could be used as a reliable benchmark for evaluating context-dependent embeddings in biomedical corpora. In addition, we conduct several experiments using a variety of discriminative and generative large language models to establish robust baselines that can serve as a foundation for future research.


Subject(s)
Natural Language Processing , Semantics , Language
4.
J Biomed Inform ; 150: 104583, 2024 02.
Article in English | MEDLINE | ID: mdl-38191010

ABSTRACT

OBJECTIVE: The primary objective of our study is to address the challenge of confidentially sharing medical images across different centers. This is often a critical necessity in both clinical and research environments, yet restrictions typically exist due to privacy concerns. Our aim is to design a privacy-preserving data-sharing mechanism that allows medical images to be stored as encoded and obfuscated representations in the public domain without revealing any useful or recoverable content from the images. In tandem, we aim to provide authorized users with compact private keys that could be used to reconstruct the corresponding images. METHOD: Our approach involves utilizing a neural auto-encoder. The convolutional filter outputs are passed through sparsifying transformations to produce multiple compact codes. Each code is responsible for reconstructing different attributes of the image. The key privacy-preserving element in this process is obfuscation through the use of specific pseudo-random noise. When applied to the codes, it becomes computationally infeasible for an attacker to guess the correct representation for all the codes, thereby preserving the privacy of the images. RESULTS: The proposed framework was implemented and evaluated using chest X-ray images for different medical image analysis tasks, including classification, segmentation, and texture analysis. Additionally, we thoroughly assessed the robustness of our method against various attacks using both supervised and unsupervised algorithms. CONCLUSION: This study provides a novel, optimized, and privacy-assured data-sharing mechanism for medical images, enabling multi-party sharing in a secure manner. While we have demonstrated its effectiveness with chest X-ray images, the mechanism can be utilized in other medical images modalities as well.


Subject(s)
Algorithms , Privacy , Information Dissemination
5.
JMIR Res Protoc ; 13: e53138, 2024 Jan 17.
Article in English | MEDLINE | ID: mdl-38231561

ABSTRACT

BACKGROUND: A medical student's career choice directly influences the physician workforce shortage and the misdistribution of resources. First, individual and contextual factors related to career choice have been evaluated separately, but their interaction over time is unclear. Second, actual career choice, reasons for this choice, and the influence of national political strategies are currently unknown in Switzerland. OBJECTIVE: The overall objective of this study is to better understand the process of Swiss medical students' career choice and to predict this choice. Our specific aims will be to examine the predominately static (ie, sociodemographic and personality traits) and predominately dynamic (ie, learning context perceptions, anxiety state, motivation, and motives for career choice) variables that predict the career choice of Swiss medical school students, as well as their interaction, and to examine the evolution of Swiss medical students' career choice and their ultimate career path, including an international comparison with French medical students. METHODS: The Swiss Medical Career Choice study is a national, multi-institution, and longitudinal study in which all medical students at all medical schools in Switzerland are eligible to participate. Data will be collected over 4 years for 4 cohorts of medical students using questionnaires in years 4 and 6. We will perform a follow-up during postgraduate training year 2 for medical graduates between 2018 and 2022. We will compare the different Swiss medical schools and a French medical school (the University of Strasbourg Faculty of Medicine). We will also examine the effect of new medical master's programs in terms of career choice and location of practice. For aim 2, in collaboration with the Swiss Institute for Medical Education, we will implement a national career choice tracking system and identify the final career choice of 2 cohorts of medical students who graduated from 4 Swiss medical schools from 2010 to 2012. We will also develop a model to predict their final career choice. Data analysis will be conducted using inferential statistics, and machine learning approaches will be used to refine the predictive model. RESULTS: This study was funded by the Swiss National Science Foundation in January 2023. Recruitment began in May 2023. Data analysis will begin after the completion of the first cohort data collection. CONCLUSIONS: Our research will inform national stakeholders and medical schools on the prediction of students' future career choice and on key aspects of physician workforce planning. We will identify targeted actions that may be implemented during medical school and may ultimately influence career choice and encourage the correct number of physicians in the right specialties to fulfill the needs of currently underserved regions. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/53138.

6.
Syst Rev ; 12(1): 94, 2023 06 05.
Article in English | MEDLINE | ID: mdl-37277872

ABSTRACT

BACKGROUND: The COVID-19 pandemic has led to an unprecedented amount of scientific publications, growing at a pace never seen before. Multiple living systematic reviews have been developed to assist professionals with up-to-date and trustworthy health information, but it is increasingly challenging for systematic reviewers to keep up with the evidence in electronic databases. We aimed to investigate deep learning-based machine learning algorithms to classify COVID-19-related publications to help scale up the epidemiological curation process. METHODS: In this retrospective study, five different pre-trained deep learning-based language models were fine-tuned on a dataset of 6365 publications manually classified into two classes, three subclasses, and 22 sub-subclasses relevant for epidemiological triage purposes. In a k-fold cross-validation setting, each standalone model was assessed on a classification task and compared against an ensemble, which takes the standalone model predictions as input and uses different strategies to infer the optimal article class. A ranking task was also considered, in which the model outputs a ranked list of sub-subclasses associated with the article. RESULTS: The ensemble model significantly outperformed the standalone classifiers, achieving a F1-score of 89.2 at the class level of the classification task. The difference between the standalone and ensemble models increases at the sub-subclass level, where the ensemble reaches a micro F1-score of 70% against 67% for the best-performing standalone model. For the ranking task, the ensemble obtained the highest recall@3, with a performance of 89%. Using an unanimity voting rule, the ensemble can provide predictions with higher confidence on a subset of the data, achieving detection of original papers with a F1-score up to 97% on a subset of 80% of the collection instead of 93% on the whole dataset. CONCLUSION: This study shows the potential of using deep learning language models to perform triage of COVID-19 references efficiently and support epidemiological curation and review. The ensemble consistently and significantly outperforms any standalone model. Fine-tuning the voting strategy thresholds is an interesting alternative to annotate a subset with higher predictive confidence.


Subject(s)
COVID-19 , Deep Learning , Humans , Pandemics , Retrospective Studies , Language
7.
Patterns (N Y) ; 4(3): 100689, 2023 Mar 10.
Article in English | MEDLINE | ID: mdl-36960445

ABSTRACT

Success rate of clinical trials (CTs) is low, with the protocol design itself being considered a major risk factor. We aimed to investigate the use of deep learning methods to predict the risk of CTs based on their protocols. Considering protocol changes and their final status, a retrospective risk assignment method was proposed to label CTs according to low, medium, and high risk levels. Then, transformer and graph neural networks were designed and combined in an ensemble model to learn to infer the ternary risk categories. The ensemble model achieved robust performance (area under the receiving operator characteristic curve [AUROC] of 0.8453 [95% confidence interval: 0.8409-0.8495]), similar to the individual architectures but significantly outperforming a baseline based on bag-of-words features (0.7548 [0.7493-0.7603] AUROC). We demonstrate the potential of deep learning in predicting the risk of CTs from their protocols, paving the way for customized risk mitigation strategies during protocol design.

8.
J Chem Inf Model ; 63(7): 1914-1924, 2023 04 10.
Article in English | MEDLINE | ID: mdl-36952584

ABSTRACT

The prediction of chemical reaction pathways has been accelerated by the development of novel machine learning architectures based on the deep learning paradigm. In this context, deep neural networks initially designed for language translation have been used to accurately predict a wide range of chemical reactions. Among models suited for the task of language translation, the recently introduced molecular transformer reached impressive performance in terms of forward-synthesis and retrosynthesis predictions. In this study, we first present an analysis of the performance of transformer models for product, reactant, and reagent prediction tasks under different scenarios of data availability and data augmentation. We find that the impact of data augmentation depends on the prediction task and on the metric used to evaluate the model performance. Second, we probe the contribution of different combinations of input formats, tokenization schemes, and embedding strategies to model performance. We find that less stable input settings generally lead to better performance. Lastly, we validate the superiority of round-trip accuracy over simpler evaluation metrics, such as top-k accuracy, using a committee of human experts and show a strong agreement for predictions that pass the round-trip test. This demonstrates the usefulness of more elaborate metrics in complex predictive scenarios and highlights the limitations of direct comparisons to a predefined database, which may include a limited number of chemical reaction pathways.


Subject(s)
Benchmarking , Electric Power Supplies , Humans , Databases, Factual , Machine Learning , Neural Networks, Computer
9.
Health Data Sci ; 3: 0099, 2023.
Article in English | MEDLINE | ID: mdl-38487204

ABSTRACT

Background: While Enterobacteriaceae bacteria are commonly found in the healthy human gut, their colonization of other body parts can potentially evolve into serious infections and health threats. We investigate a graph-based machine learning model to predict risks of inpatient colonization by multidrug-resistant (MDR) Enterobacteriaceae. Methods: Colonization prediction was defined as a binary task, where the goal is to predict whether a patient is colonized by MDR Enterobacteriaceae in an undesirable body part during their hospital stay. To capture topological features, interactions among patients and healthcare workers were modeled using a graph structure, where patients are described by nodes and their interactions are described by edges. Then, a graph neural network (GNN) model was trained to learn colonization patterns from the patient network enriched with clinical and spatiotemporal features. Results: The GNN model achieves performance between 0.91 and 0.96 area under the receiver operating characteristic curve (AUROC) when trained in inductive and transductive settings, respectively, up to 8% above a logistic regression baseline (0.88). Comparing network topologies, the configuration considering ward-related edges (0.91 inductive, 0.96 transductive) outperforms the configurations considering caregiver-related edges (0.88, 0.89) and both types of edges (0.90, 0.94). For the top 3 most prevalent MDR Enterobacteriaceae, the AUROC varies from 0.94 for Citrobacter freundii up to 0.98 for Enterobacter cloacae using the best-performing GNN model. Conclusion: Topological features via graph modeling improve the performance of machine learning models for Enterobacteriaceae colonization prediction. GNNs could be used to support infection prevention and control programs to detect patients at risk of colonization by MDR Enterobacteriaceae and other bacteria families.

10.
Syst Rev ; 11(1): 172, 2022 08 17.
Article in English | MEDLINE | ID: mdl-35978441

ABSTRACT

BACKGROUND: Identifying and removing reference duplicates when conducting systematic reviews (SRs) remain a major, time-consuming issue for authors who manually check for duplicates using built-in features in citation managers. To address issues related to manual deduplication, we developed an automated, efficient, and rapid artificial intelligence-based algorithm named Deduklick. Deduklick combines natural language processing algorithms with a set of rules created by expert information specialists. METHODS: Deduklick's deduplication uses a multistep algorithm of data normalization, calculates a similarity score, and identifies unique and duplicate references based on metadata fields, such as title, authors, journal, DOI, year, issue, volume, and page number range. We measured and compared Deduklick's capacity to accurately detect duplicates with the information specialists' standard, manual duplicate removal process using EndNote on eight existing heterogeneous datasets. Using a sensitivity analysis, we manually cross-compared the efficiency and noise of both methods. DISCUSSION: Deduklick achieved average recall of 99.51%, average precision of 100.00%, and average F1 score of 99.75%. In contrast, the manual deduplication process achieved average recall of 88.65%, average precision of 99.95%, and average F1 score of 91.98%. Deduklick achieved equal to higher expert-level performance on duplicate removal. It also preserved high metadata quality and drastically reduced time spent on analysis. Deduklick represents an efficient, transparent, ergonomic, and time-saving solution for identifying and removing duplicates in SRs searches. Deduklick could therefore simplify SRs production and represent important advantages for scientists, including saving time, increasing accuracy, reducing costs, and contributing to quality SRs.


Subject(s)
Algorithms , Artificial Intelligence , Systematic Reviews as Topic , Biomedical Research , Humans , Natural Language Processing
11.
Stud Health Technol Inform ; 294: 839-843, 2022 May 25.
Article in English | MEDLINE | ID: mdl-35612222

ABSTRACT

The importance of genomic data for health is rapidly growing but accessing and gathering information about variants from different sources is hindered by highly heterogeneous representations of variants, as outlined by clinical associations (AMP/ASCO/CAP) in their recommendations. To enable a smooth and effective retrieval of variant-containing documents from different resources, we developed a tool (https://goldorak.hesge.ch/synvar/) that generates for any given SNP - including variant not present in existing databases - its corresponding description at the genome, transcript and protein levels. It provides variant descriptions in the HGVS format as well as in many non-standard formats found in the literature along with database identifiers. We present the SynVar service and evaluate its impact on the recall of a genomic variant curation-support service. Using SynVar to search variants in the literature enables to increase the recall by +133.8% without a strong impact on precision (i.e. 93%).


Subject(s)
Genomics , Databases, Factual
12.
Stud Health Technol Inform ; 294: 876-877, 2022 May 25.
Article in English | MEDLINE | ID: mdl-35612233

ABSTRACT

We present an analysis of supplementary materials of PubMed Central (PMC) articles and show their importance in indexing and searching biomedical literature, in particular for the emerging genomic medicine field. On a subset of articles from PubMed Central, we use text mining methods to extract MeSH terms from abstracts, full texts, and text-based supplementary materials. We find that the recall of MeSH annotations increases by about 5.9 percentage points (+20% on relative percentage) when considering supplementary materials compared to using only abstracts. We further compare the supplementary material annotations with full-text annotations and we find out that the recall of MeSH terms increases by 1.5 percentage point (+3% on relative percentage). Additionally, we analyze genetic variant mentions in abstracts and full-texts and compare them with mentions found in supplementary text-based files. We find that the majority (about 99%) of variants are found in text-based supplementary files. In conclusion, we suggest that supplementary data should receive more attention from the information retrieval community, in particular in life and health sciences.


Subject(s)
Medical Subject Headings , Text Messaging , Data Mining/methods , PubMed , Records
13.
Front Res Metr Anal ; 6: 689803, 2021.
Article in English | MEDLINE | ID: mdl-34870074

ABSTRACT

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

14.
Front Digit Health ; 3: 745674, 2021.
Article in English | MEDLINE | ID: mdl-34796360

ABSTRACT

The 2019 coronavirus (COVID-19) pandemic revealed the urgent need for the acceleration of vaccine development worldwide. Rapid vaccine development poses numerous risks for each category of vaccine technology. By using the Risklick artificial intelligence (AI), we estimated the risks associated with all types of COVID-19 vaccine during the early phase of vaccine development. We then performed a postmortem analysis of the probability and the impact matrix calculations by comparing the 2020 prognosis to the contemporary situation. We used the Risklick AI to evaluate the risks and their incidence associated with vaccine development in the early stage of the COVID-19 pandemic. Our analysis revealed the diversity of risks among vaccine technologies currently used by pharmaceutical companies providing vaccines. This analysis highlighted the current and future potential pitfalls connected to vaccine production during the COVID-19 pandemic. Hence, the Risklick AI appears as an essential tool in vaccine development for the treatment of COVID-19 in order to formally anticipate the risks, and increases the overall performance from the production to the distribution of the vaccines. The Risklick AI could, therefore, be extended to other fields of research and development and represent a novel opportunity in the calculation of production-associated risks.

15.
J Med Internet Res ; 23(9): e30161, 2021 09 17.
Article in English | MEDLINE | ID: mdl-34375298

ABSTRACT

BACKGROUND: The COVID-19 global health crisis has led to an exponential surge in published scientific literature. In an attempt to tackle the pandemic, extremely large COVID-19-related corpora are being created, sometimes with inaccurate information, which is no longer at scale of human analyses. OBJECTIVE: In the context of searching for scientific evidence in the deluge of COVID-19-related literature, we present an information retrieval methodology for effective identification of relevant sources to answer biomedical queries posed using natural language. METHODS: Our multistage retrieval methodology combines probabilistic weighting models and reranking algorithms based on deep neural architectures to boost the ranking of relevant documents. Similarity of COVID-19 queries is compared to documents, and a series of postprocessing methods is applied to the initial ranking list to improve the match between the query and the biomedical information source and boost the position of relevant documents. RESULTS: The methodology was evaluated in the context of the TREC-COVID challenge, achieving competitive results with the top-ranking teams participating in the competition. Particularly, the combination of bag-of-words and deep neural language models significantly outperformed an Okapi Best Match 25-based baseline, retrieving on average, 83% of relevant documents in the top 20. CONCLUSIONS: These results indicate that multistage retrieval supported by deep learning could enhance identification of literature for COVID-19-related questions posed using natural language.


Subject(s)
COVID-19 , Algorithms , Humans , Information Storage and Retrieval , Language , SARS-CoV-2
16.
Pharmacology ; 106(5-6): 244-253, 2021.
Article in English | MEDLINE | ID: mdl-33910199

ABSTRACT

INTRODUCTION: The SARS-CoV-2 pandemic has led to one of the most critical and boundless waves of publications in the history of modern science. The necessity to find and pursue relevant information and quantify its quality is broadly acknowledged. Modern information retrieval techniques combined with artificial intelligence (AI) appear as one of the key strategies for COVID-19 living evidence management. Nevertheless, most AI projects that retrieve COVID-19 literature still require manual tasks. METHODS: In this context, we pre-sent a novel, automated search platform, called Risklick AI, which aims to automatically gather COVID-19 scientific evidence and enables scientists, policy makers, and healthcare professionals to find the most relevant information tailored to their question of interest in real time. RESULTS: Here, we compare the capacity of Risklick AI to find COVID-19-related clinical trials and scientific publications in comparison with clinicaltrials.gov and PubMed in the field of pharmacology and clinical intervention. DISCUSSION: The results demonstrate that Risklick AI is able to find COVID-19 references more effectively, both in terms of precision and recall, compared to the baseline platforms. Hence, Risklick AI could become a useful alternative assistant to scientists fighting the COVID-19 pandemic.


Subject(s)
Artificial Intelligence/trends , COVID-19/therapy , Data Interpretation, Statistical , Drug Development/trends , Evidence-Based Medicine/trends , Pharmacology/trends , Artificial Intelligence/statistics & numerical data , COVID-19/diagnosis , COVID-19/epidemiology , Clinical Trials as Topic/statistics & numerical data , Drug Development/statistics & numerical data , Evidence-Based Medicine/statistics & numerical data , Humans , Pharmacology/statistics & numerical data , Registries
17.
Database (Oxford) ; 20202020 01 01.
Article in English | MEDLINE | ID: mdl-32367111

ABSTRACT

In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.


Subject(s)
Deep Learning , Databases, Protein , Knowledge Bases , Molecular Sequence Annotation , Proteins/genetics
18.
PLoS One ; 13(1): e0190028, 2018.
Article in English | MEDLINE | ID: mdl-29293556

ABSTRACT

The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.


Subject(s)
Benchmarking , Database Management Systems , Datasets as Topic , Electronic Health Records , Brazil
20.
PLoS One ; 11(3): e0150069, 2016.
Article in English | MEDLINE | ID: mdl-26958859

ABSTRACT

This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.


Subject(s)
Database Management Systems , Electronic Health Records , Programming Languages , Databases as Topic , Humans , Multilevel Analysis , Search Engine , Time Factors
SELECTION OF CITATIONS
SEARCH DETAIL