RESUMEN
Today's scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating research is the reduction of development, adaptation, and maintenance times of DAWs. We describe the design and setup of the Collaborative Research Center (CRC) 1404 "FONDA -- Foundations of Workflows for Large-Scale Scientific Data Analysis", in which roughly 50 researchers jointly investigate new technologies, algorithms, and models to increase the portability, adaptability, and dependability of DAWs executed over distributed infrastructures. We describe the motivation behind our project, explain its underlying core concepts, introduce FONDA's internal structure, and sketch our vision for the future of workflow-based scientific data analysis. We also describe some lessons learned during the "making of" a CRC in Computer Science with strong interdisciplinary components, with the aim to foster similar endeavors.
RESUMEN
The cryptographic method Secure Multi-Party Computation (SMPC) could facilitate data sharing between health institutions by making it possible to perform analyses on a "virtual data pool", providing an integrated view of data that is actually distributed - without any of the participants having to disclose their private data. One drawback of SMPC is that specific cryptographic protocols have to be developed for every type of analysis that is to be performed. Moreover, these protocols have to be optimized to provide acceptable execution times. As a first step towards a library of efficient implementations of common methods in health data sciences, we present a novel protocol for efficient time-to-event analysis. Our implementation utilizes a common technique called garbled circuits and was implemented using a widespread SMPC programming framework. We further describe optimizations that we have developed to reduce the execution times of our protocol. We experimentally evaluated our solution by computing Kaplan-Meier estimators over a vertically distributed dataset while measuring performance. By comparing the SMPC results with a conventional analysis on pooled data, we show that our approach is practical and scalable.