RESUMEN
BACKGROUND: Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve reproducibility as well as quality in biomedical research and, therefore, to foster good scientific practice. However, despite the increasing interest on data provenance technologies in the literature and their implementation in other disciplines, these technologies have not yet been widely adopted in biomedical research. OBJECTIVE: The aim of this scoping review was to provide a structured overview of the body of knowledge on provenance methods in biomedical research by systematizing articles covering data provenance technologies developed for or used in this application area; describing and comparing the functionalities as well as the design of the provenance technologies used; and identifying gaps in the literature, which could provide opportunities for future research on technologies that could receive more widespread adoption. METHODS: Following a methodological framework for scoping studies and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, articles were identified by searching the PubMed, IEEE Xplore, and Web of Science databases and subsequently screened for eligibility. We included original articles covering software-based provenance management for scientific research published between 2010 and 2021. A set of data items was defined along the following five axes: publication metadata, application scope, provenance aspects covered, data representation, and functionalities. The data items were extracted from the articles, stored in a charting spreadsheet, and summarized in tables and figures. RESULTS: We identified 44 original articles published between 2010 and 2021. We found that the solutions described were heterogeneous along all axes. We also identified relationships among motivations for the use of provenance information, feature sets (capture, storage, retrieval, visualization, and analysis), and implementation details such as the data models and technologies used. The important gap that we identified is that only a few publications address the analysis of provenance data or use established provenance standards, such as PROV. CONCLUSIONS: The heterogeneity of provenance methods, models, and implementations found in the literature points to the lack of a unified understanding of provenance concepts for biomedical data. Providing a common framework, a biomedical reference, and benchmarking data sets could foster the development of more comprehensive provenance solutions.
Asunto(s)
Investigación Biomédica , Humanos , Metadatos , PubMed , Reproducibilidad de los Resultados , Programas InformáticosRESUMEN
The identification of vulnerable records (targets) is an important step for many privacy attacks on protected health data. We implemented and evaluated three outlier metrics for detecting potential targets. Next, we assessed differences and similarities between the top-k targets suggested by the different methods and studied how susceptible those targets are to membership inference attacks on synthetic data. Our results suggest that there is no one-size-fits-all approach and that target selection methods should be chosen based on the type of attack that is to be performed.
Asunto(s)
Seguridad Computacional , Confidencialidad , Registros Electrónicos de Salud , HumanosRESUMEN
Sharing biomedical data for research can help to improve disease understanding and support the development of preventive, diagnostic, and therapeutic methods. However, it is vital to balance the amount of data shared and the sharing mechanism chosen with the privacy protection provided. This requires a detailed understanding of potential adversaries who might attempt to re-identify data and the consequences of their actions. The aim of this paper is to present a comprehensive list of potential types of adversaries, motivations, and harms to targeted individuals. A group of 13 researchers performed a three-step process in a one-day workshop, involving the identification of adversaries, the categorization by motivation, and the deduction of potential harms. The group collected 28 suggestions and categorized them into six types, each associated with several of six distinct harms. The findings align with previous efforts in structuring threat actors and outcomes and we believe that they provide a robust foundation for evaluating re-identification risks and developing protection measures in health data sharing scenarios.