Búsqueda | Portal Regional de la BVS

1.

Adding data provenance support to Apache Spark.

Interlandi, Matteo; Ekmekji, Ari; Shah, Kshitij; Gulzar, Muhammad Ali; Tetali, Sai Deep; Kim, Miryung; Millstein, Todd; Condie, Tyson.

VLDB J ; 27(5): 595-615, 2018 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-31007500

RESUMEN

Debugging data processing logic in data-intensive scalable computing (DISC) systems is a difficult and time-consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result, programmers spend countless hours collecting evidence (e.g., from log files) and performing trial-and-error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders of magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

2.

mCerebrum: A Mobile Sensing Software Platform for Development and Validation of Digital Biomarkers and Interventions.

Hossain, Syed Monowar; Hnat, Timothy; Saleheen, Nazir; Nasrin, Nusrat Jahan; Noor, Joseph; Ho, Bo-Jhang; Condie, Tyson; Srivastava, Mani; Kumar, Santosh.

Proc Int Conf Embed Netw Sens Syst ; 20172017 Nov.

Artículo en Inglés | MEDLINE | ID: mdl-30288504

RESUMEN

The development and validation studies of new multisensory biomarkers and sensor-triggered interventions requires collecting raw sensor data with associated labels in the natural field environment. Unlike platforms for traditional mHealth apps, a software platform for such studies needs to not only support high-rate data ingestion, but also share raw high-rate sensor data with researchers, while supporting high-rate sense-analyze-act functionality in real-time. We present mCerebrum, a realization of such a platform, which supports high-rate data collections from multiple sensors with realtime assessment of data quality. A scalable storage architecture (with near optimal performance) ensures quick response despite rapidly growing data volume. Micro-batching and efficient sharing of data among multiple source and sink apps allows reuse of computations to enable real-time computation of multiple biomarkers without saturating the CPU or memory. Finally, it has a reconfigurable scheduler which manages all prompts to participants that is burden- and context-aware. With a modular design currently spanning 23+ apps, mCerebrum provides a comprehensive ecosystem of system services and utility apps. The design of mCerebrum has evolved during its concurrent use in scientific field studies at ten sites spanning 106,806 person days. Evaluations show that compared with other platforms, mCerebrum's architecture and design choices support 1.5 times higher data rates and 4.3 times higher storage throughput, while causing 8.4 times lower CPU usage. CCS CONCEPTS: â¢ Human-centered computing â Ubiquitous and mobile computing; Ubiquitous and mobile computing systems and tools; â¢ Computer systems organization â Embedded and cyber-physical systems. ACM REFERENCE FORMAT: Syed Monowar Hossain, Timothy Hnat, Nazir Saleheen, Nusrat Jahan Nasrin, Joseph Noor, Bo-Jhang Ho, Tyson Condie, Mani Srivastava, and Santosh Kumar. 2017. mCerebrum: A Mobile Sensing Software Platform for Development and Validation of Digital Biomarkers and Interventions. In Proceedings of SenSys '17, Delft, Netherlands, November 6-8, 2017, 14 pages.

3.

Automated Debugging in Data-Intensive Scalable Computing.

Gulzar, Muhammad Ali; Interlandi, Matteo; Han, Xueyuan; Li, Mingda; Condie, Tyson; Kim, Miryung.

Proc ACM Symp Cloud Comput ; 2017: 520-534, 2017 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-31008457

RESUMEN

Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BigSift is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. BigSift redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift improves the accuracy of fault localizability by several orders-of-magnitude (~ 103 to 107×) compared to Titian data provenance, and improves performance by up to 66× compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BigSift is able to localize fault-inducing data within 62% of the original job running time.

4.

BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark.

Gulzar, Muhammad Ali; Interlandi, Matteo; Yoo, Seunghyun; Tetali, Sai Deep; Condie, Tyson; Millstein, Todd; Kim, Miryung.

Proc Int Conf Softw Eng ; 2016: 784-795, 2016 May.

Artículo en Inglés | MEDLINE | ID: mdl-27390389

RESUMEN

Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today's data-centers is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires re-thinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user. First, BIGDEBUG's simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BIGDEBUG scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BIGDEBUG supports debugging at interactive speeds with minimal performance impact.

5.

Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale.

Huang, Muhuan; Wu, Di; Yu, Cody Hao; Fang, Zhenman; Interlandi, Matteo; Condie, Tyson; Cong, Jason.

Proc ACM Symp Cloud Comput ; 2016: 456-469, 2016 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-28317049

RESUMEN

With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems-like Apache Spark and Hadoop-to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7 × to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.

6.

Big Data Analytics with Datalog Queries on Spark.

Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo.

Proc ACM SIGMOD Int Conf Manag Data ; 2016: 1135-1149, 2016.

Artículo en Inglés | MEDLINE | ID: mdl-28626296

RESUMEN

There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.

7.

Optimizing Interactive Development of Data-Intensive Applications.

Interlandi, Matteo; Tetali, Sai Deep; Gulzar, Muhammad Ali; Noor, Joseph; Condie, Tyson; Kim, Miryung; Millstein, Todd.

Proc ACM Symp Cloud Comput ; 2016: 510-522, 2016 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-28405637

RESUMEN

Modern Data-Intensive Scalable Computing (DISC) systems are designed to process data through batch jobs that execute programs (e.g., queries) compiled from a high-level language. These programs are often developed interactively by posing ad-hoc queries over the base data until a desired result is generated. We observe that there can be significant overlap in the structure of these queries used to derive the final program. Yet, each successive execution of a slightly modified query is performed anew, which can significantly increase the development cycle. Vega is an Apache Spark framework that we have implemented for optimizing a series of similar Spark programs, likely originating from a development or exploratory data analysis session. Spark developers (e.g., data scientists) can leverage Vega to significantly reduce the amount of time it takes to re-execute a modified Spark program, reducing the overall time to market for their Big Data applications.

8.

Center of excellence for mobile sensor data-to-knowledge (MD2K).

Kumar, Santosh; Abowd, Gregory D; Abraham, William T; al'Absi, Mustafa; Beck, J Gayle; Chau, Duen Horng; Condie, Tyson; Conroy, David E; Ertin, Emre; Estrin, Deborah; Ganesan, Deepak; Lam, Cho; Marlin, Benjamin; Marsh, Clay B; Murphy, Susan A; Nahum-Shani, Inbal; Patrick, Kevin; Rehg, James M; Sharmin, Moushumi; Shetty, Vivek; Sim, Ida; Spring, Bonnie; Srivastava, Mani; Wetter, David W.

J Am Med Inform Assoc ; 22(6): 1137-42, 2015 Nov.

Artículo en Inglés | MEDLINE | ID: mdl-26555017

RESUMEN

Mobile sensor data-to-knowledge (MD2K) was chosen as one of 11 Big Data Centers of Excellence by the National Institutes of Health, as part of its Big Data-to-Knowledge initiative. MD2K is developing innovative tools to streamline the collection, integration, management, visualization, analysis, and interpretation of health data generated by mobile and wearable sensors. The goal of the big data solutions being developed by MD2K is to reliably quantify physical, biological, behavioral, social, and environmental factors that contribute to health and disease risk. The research conducted by MD2K is targeted at improving health through early detection of adverse health events and by facilitating prevention. MD2K will make its tools, software, and training materials widely available and will also organize workshops and seminars to encourage their use by researchers and clinicians.

Asunto(s)

Investigación Biomédica/instrumentación , Conjuntos de Datos como Asunto , Telemedicina/instrumentación , Telemetría , Sistemas de Información Geográfica/instrumentación , Humanos , National Institutes of Health (U.S.) , Estados Unidos

9.

REEF: Retainable Evaluator Execution Framework.

Weimer, Markus; Chen, Yingda; Chun, Byung-Gon; Condie, Tyson; Curino, Carlo; Douglas, Chris; Lee, Yunseong; Majestro, Tony; Malkhi, Dahlia; Matusevych, Sergiy; Myers, Brandon; Narayanamurthy, Shravan; Ramakrishnan, Raghu; Rao, Sriram; Sears, Russell; Sezgin, Beysim; Wang, Julia.

Proc ACM SIGMOD Int Conf Manag Data ; 2015: 1343-1355, 2015.

Artículo en Inglés | MEDLINE | ID: mdl-26819493

RESUMEN

Resource Managers like Apache YARN have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault-tolerance, task scheduling and coordination) and re-implement common mechanisms (e.g., caching, bulk-data transfers). This paper presents REEF, a development framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource re-use for data caching, and state management abstractions that greatly ease the development of elastic data processing work-flows on cloud platforms that support a Resource Manager service. REEF is being used to develop several commercial offerings such as the Azure Stream Analytics service. Furthermore, we demonstrate REEF development of a distributed shell application, a machine learning algorithm, and a port of the CORFU [4] system. REEF is also currently an Apache Incubator project that has attracted contributors from several instititutions.

10.

Titian: Data Provenance Support in Spark.

Interlandi, Matteo; Shah, Kshitij; Tetali, Sai Deep; Gulzar, Muhammad Ali; Yoo, Seunghyun; Kim, Miryung; Millstein, Todd; Condie, Tyson.

Proceedings VLDB Endowment ; 9(3): 216-227, 2015 Nov.

Artículo en Inglés | MEDLINE | ID: mdl-26726305

RESUMEN

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA