Enabling ad-hoc reuse of private data repositories through schema extraction.

Gleim, Lars Christoph; Karim, Md Rezaul; Zimmermann, Lukas; Kohlbacher, Oliver; Stenzhorn, Holger; Decker, Stefan; Beyan, Oya

Gleim, Lars Christoph; Karim, Md Rezaul; Zimmermann, Lukas; Kohlbacher, Oliver; Stenzhorn, Holger; Decker, Stefan; Beyan, Oya.

Afiliación

Gleim LC; Informatik 5, RWTH Aachen University, Ahornstr. 55, Aachen, 52062, Germany. gleim@cs.rwth-aachen.de.
Karim MR; Informatik 5, RWTH Aachen University, Ahornstr. 55, Aachen, 52062, Germany.
Zimmermann L; Fraunhofer FIT, Schloss Birlinghoven, Sankt Augustin, 53754, Germany.
Kohlbacher O; Institute for Translational Bioinformatics, University Hospital Tübingen, Sand 14, Tübingen, 72076, Germany.
Stenzhorn H; Institute for Translational Bioinformatics, University Hospital Tübingen, Sand 14, Tübingen, 72076, Germany.
Decker S; Applied Bioinformatics, Department of Computer Science, University of Tübingen, Sand 14, Tübingen, 72076, Germany.
Beyan O; Institute for Bioinformatics and Medical Informatics, University of Tübingen, Sand 14, Tübingen, 72076, Germany.

J Biomed Semantics ; 11(1): 6, 2020 07 08.

Article en En | MEDLINE | ID: mdl-32641124

ABSTRACT

ABSTRACT

BACKGROUND:

Sharing sensitive data across organizational boundaries is often significantly limited by legal and ethical restrictions. Regulations such as the EU General Data Protection Rules (GDPR) impose strict requirements concerning the protection of personal and privacy sensitive data. Therefore new approaches, such as the Personal Health Train initiative, are emerging to utilize data right in their original repositories, circumventing the need to transfer data.

RESULTS:

Circumventing limitations of previous systems, this paper proposes a configurable and automated schema extraction and publishing approach, which enables ad-hoc SPARQL query formulation against RDF triple stores without requiring direct access to the private data. The approach is compatible with existing Semantic Web-based technologies and allows for the subsequent execution of such queries in a safe setting under the data provider's control. Evaluation with four distinct datasets shows that a configurable amount of concise and task-relevant schema, closely describing the structure of the underlying data, was derived, enabling the schema introspection-assisted authoring of SPARQL queries.

CONCLUSIONS:

Automatically extracting and publishing data schema can enable the introspection-assisted creation of data selection and integration queries. In conjunction with the presented system architecture, this approach can enable reuse of data from private repositories and in settings where agreeing upon a shared schema and encoding a priori is infeasible. As such, it could provide an important step towards reuse of data from previously inaccessible sources and thus towards the proliferation of data-driven methods in the biomedical domain.

Asunto(s)

Almacenamiento y Recuperación de la Información; Privacidad; Seguridad Computacional/legislación & jurisprudencia; Estudios de Factibilidad; Internet

Palabras clave

Data access; Distributed systems; FAIR data; Linked data; Personal health train; Privacy; Query design; RDF; SPARQL; Schema extraction; Semantic web

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Banco de datos: MEDLINE Asunto principal: Almacenamiento y Recuperación de la Información / Privacidad Idioma: En Revista: J Biomed Semantics Año: 2020 Tipo del documento: Article País de afiliación: Alemania

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google