Fast Data Platforms

Resource Allocation & Rebalancing for Distributed Stream Processing

Anshu Shukla

Motivation

Stream Processing frameworks like Apache Storm, Apache Spark Streaming, etc. are widely used to process realtime data. These distributed stream processors are used to tackle the Velocity dimension of Big Data. The velocity dimension of Big Data refers to the need to rapidly process data that arrives continuously as streams of messages or events. Distributed Stream Processing Systems (DSPS) refer to distributed programming and runtime platforms that allow users to define a composition of dataflow logic that are executed on distributed resources over streams of incoming messages.
A DSPS uses commodity clusters and Cloud Virtual Machines (VMs) for its execution. In order to meet the required performance for these applications, the DSPS needs to schedule these dataflows efficiently over the resources. Despite their growing use, resource scheduling for DSPS’s tends to be done in an adhoc manner, favoring empirical and reactive approaches, rather than a model-driven and analytical approach. Such empirical strategies may arrive at an approximate schedule for the dataflow that needs further tuning to meet the quality of service.

Contributions

We propose a model-based scheduling approach that makes use of performance profiles and benchmarks developed for tasks in the dataflow to plan both the resource allocation and the resource mapping that together form the schedule planning process. We propose the Model Based Allocation (MBA) and the Slot Aware Mapping (SAM) approaches that effectively utilize knowledge of the performance model of logic tasks to provide an efficient and predictable scheduling behavior. We implemented and validate these algorithms using the popular open source Apache Storm DSPS for several micro and application dataflows. Also we see that our strategies offer a predictable behavior that ensures that the expected and actual rates supported and resources used match closely. This can enable deterministic schedule planning even under dynamic conditions.
Besides this static scheduling, we also examine the ability to dynamically consolidate tasks onto fewer VMs when the load on the dataflow decreases or the VMs get fragmented. We propose reliable task migration models for Apache Storm dataflows that are able to rapidly move the task assignment in the cluster, and resume the dataflow execution without any message loss.

Software

Publications

  • Anshu Shukla, Tarun Sharma and Yogesh Simmhan, “Characterizing Distributed Stream Processing Systems for IoT Applications (Extended Abstract),” in Workshop on Architectural Support and Middleware for InfoSymbiotics/ Dynamic Data Driven Applications Systems (DDDAS), co-located with High Performance Computing Conference (HiPC), IEEE- 2015. doi:10.1109/HiPCW.2015.22
  • Anshu Shukla, Yogesh Simmhan, “Model-driven Scheduling for Distributed Stream Processing Systems ,” Technical report, ArXiv, 2017.CoRR abs/1702.01785

Dataflow Reuse for Distributed Stream Processing

Shilpa Chaturvedi, Sahil Tyagi

Motivation

IoT deployements comprising of sensors and actuators that collect observational data provides continuous streams of data, often called Streaming or Fast Data. Smart Cities use such IoT technologies for providing effective citizen services, and improving the efficiency of the utility infrastructure. A key aspect of such Smart City applications is the availability of streaming data from possibly millions of sensors, and more importantly, the need to analyze and process them in near real-time to make decisions or provide services. Distributed Stream Processing Systems (DSPS) are data platform tailored to compose dataflows that execute constantly over one or more data streams. Such platforms are commonly used to compose IoT and Smart City applications hosted on the Cloud, and access sensor streams that have been pulled from the edge into the data-center.

Such cloud-hosted DSPS provides scalable analytics engine for composing and executing these continuous dataflows, collocated with the data streams. At the same time, there can be duplication of tasks by the numerous dataflows that operate on these shared streams, where each may perform similar data pre-processing (parsing, reformat, unit conversion), quality checks (cleaning, outlier detection,interpolation), and even analytics (ARIMA time-series predictions, moving window averages). This offers the opportunity to reuse parts of the logic among different dataflows to avoid recomputation, thereby reducing the costs for using Cloud resources for app developers and end-users, and deployment time as well.

Contributions

We formally define the problem of streaming dataflow reuse, including the equivalence between tasks present in dataflows. We explore the reuse of stream generated from a sub DAG that can be used in other dataflows that has identical sub DAG as a prefix. We propose algorithms for merging a submitted streaming dataflow with deployed dataflows at specific points of equivalence, and similarly, unmerging a merged dataflow when it is removed, while guaranteeing their output stream consistency. We implement our reuse algorithms in Apache Storm, and validate it for real and synthetic Smart Utility applications and public OPMW workflows.

Currently, we are also exploring reuse of task as a service among different dataflows deployed in the system. This will allow reuse at the granularity of task level. The key challenge here is that since tasks can have state, we need to make sure state of each dataflow is not affected by state of other dataflows reusing the same task. Also we want to explore how provenance can play a role in tracing the consumption of resources among dataflows. This will help in fair billing of resources to the dataflows from different users that are being reused. It can also help in validating whether the output generated from dataflows with reuse is exactly same as the one generated when they run individually, without reuse.

Another potential benefit of reuse is in resilient processing of the dataflows, both from a reliability perspective as well as security against attacks.

Publications

  • Shilpa Chaturvedi, Sahil Tyagi and Yogesh Simmhan, “Collaborative Reuse of Streaming Dataflows in IoT Applications” in IEEE International Conference on eScience, 2017 (To Appear)

Benchmarking Fast Data Platforms

Anshu Shukla, Shilpa Chaturvedi

Motivation

There are different distributed commercial and open source stream processing platforms out there and each has its own design constraints, capabilities and limitations. In such scenarios, there is often a need to benchmark such Big Data platforms for performance and scalability metrics. This provides a baseline for Big Data researchers and developers to uniformly compare DSPS platforms for different domains and different workloads.

Contributions

We classify different characteristics of streaming applications, their composition semantics, and their data sources. We propose categories of tasks that are essential for IoT applications and the key features of input data streams they operate upon. We identify performance metrics of DSPS that are necessary to meet the latency and scalability needs of streaming IoT applications. We propose the RIoTBench real-time IoT benchmark for DSPS based on representative micro-benchmark tasks. This work extends our earlier published as ‘Benchmarking Distributed Stream Processing Platforms for IoT Applications’.

Earlier, we proposed a stream processing workload, based on the Aadhaar enrollment and authentication applications, as a Big Data benchmark for distributed stream processing systems. We describe the application composition, and characterize their task latencies and selectivity, and data rate and size distributions, based on real observations. We also validate this benchmark on Apache Storm using synthetic streams and simulated application logic.

Software

Publications

  • Yogesh Simmhan, Anshu Shukla and Arun Verma, “Benchmarking Fast Data Platforms for the Aadhaar Biometric Database,” in Workshop on Big Data Benchmarking (WBDB), 2015. doi:10.1007/978-3-319-49748-8_2
  • Anshu Shukla and Yogesh Simmhan, “ Benchmarking Distributed Stream Processing Platforms for IoT Applications ,” in TPC Technology Conference on Performance Evaluation and Benchmarking (TPCTC), co-located with International Conference on Very Large Data Bases (VLDB), 2016.doi:10.1007/978-3-319-54334- 57
  • Anshu Shukla, Shilpa Chaturvedi, Yogesh Simmhan, “RIoTBench: A Real-time IoT Benchmark for Distributed Stream Processing Platforms ,” in Concurrency and Computation: Practice and Experience, 2017.CoRR abs/1701.08530