Resource Allocation & Rebalancing for Distributed Stream Processing
Stream Processing frameworks like Apache Storm, Apache Spark Streaming, etc. are widely used to process realtime data. These distributed stream processors are used to tackle the Velocity dimension of Big Data. The velocity dimension of Big Data refers to the need to rapidly process data that arrives continuously as streams of messages or events. Distributed Stream Processing Systems (DSPS) refer to distributed programming and runtime platforms that allow users to define a composition of dataflow logic that are executed on distributed resources over streams of incoming messages.
A DSPS uses commodity clusters and Cloud Virtual Machines (VMs) for its execution. In order to meet the required performance for these applications, the DSPS needs to schedule these dataflows efficiently over the resources. Despite their growing use, resource scheduling for DSPS’s tends to be done in an adhoc manner, favoring empirical and reactive approaches, rather than a model-driven and analytical approach. Such empirical strategies may arrive at an approximate schedule for the dataflow that needs further tuning to meet the quality of service.
We propose a model-based scheduling approach that makes use of performance profiles and benchmarks developed for tasks in the dataflow to plan both the resource allocation and the resource mapping that together form the schedule planning process. We propose the Model Based Allocation (MBA) and the Slot Aware Mapping (SAM) approaches that effectively utilize knowledge of the performance model of logic tasks to provide an efficient and predictable scheduling behavior. We implemented and validate these algorithms using the popular open source Apache Storm DSPS for several micro and application dataflows. Also we see that our strategies offer a predictable behavior that ensures that the expected and actual rates supported and resources used match closely. This can enable deterministic schedule planning even under dynamic conditions.
Besides this static scheduling, we also examine the ability to dynamically consolidate tasks onto fewer VMs when the load on the dataflow decreases or the VMs get fragmented. We propose reliable task migration models for Apache Storm dataflows that are able to rapidly move the task assignment in the cluster, and resume the dataflow execution without any message loss.
- Custom Scheduler for Storm: Storm-CustomScheduler
- Anshu Shukla, Tarun Sharma and Yogesh Simmhan, “Characterizing Distributed Stream Processing Systems for IoT Applications (Extended Abstract),” in Workshop on Architectural Support and Middleware for InfoSymbiotics/ Dynamic Data Driven Applications Systems (DDDAS), co-located with High Performance Computing Conference (HiPC), IEEE- 2015. doi:10.1109/HiPCW.2015.22
- Toward Reliable and Rapid Elasticity for Streaming Dataflows on Clouds, IEEE International Conference on Distributed Computing Systems (ICDCS), 2018
- Model-driven Scheduling for Distributed Stream Processing Systems, Journal of Parallel and Distributed Computing (JPDC), 2018
Dataflow Reuse for Distributed Stream Processing
Shilpa Chaturvedi, Sahil Tyagi
IoT deployements comprising of sensors and actuators that collect observational data provides continuous streams of data, often called Streaming or Fast Data. Smart Cities use such IoT technologies for providing effective citizen services, and improving the efficiency of the utility infrastructure. A key aspect of such Smart City applications is the availability of streaming data from possibly millions of sensors, and more importantly, the need to analyze and process them in near real-time to make decisions or provide services. Distributed Stream Processing Systems (DSPS) are data platform tailored to compose dataflows that execute constantly over one or more data streams. Such platforms are commonly used to compose IoT and Smart City applications hosted on the Cloud, and access sensor streams that have been pulled from the edge into the data-center.
Such cloud-hosted DSPS provides scalable analytics engine for composing and executing these continuous dataflows, collocated with the data streams. At the same time, there can be duplication of tasks by the numerous dataflows that operate on these shared streams, where each may perform similar data pre-processing (parsing, reformat, unit conversion), quality checks (cleaning, outlier detection,interpolation), and even analytics (ARIMA time-series predictions, moving window averages). This offers the opportunity to reuse parts of the logic among different dataflows to avoid recomputation, thereby reducing the costs for using Cloud resources for app developers and end-users, and deployment time as well.
We formally define the problem of streaming dataflow reuse, including the equivalence between tasks present in dataflows. We explore the reuse of stream generated from a sub DAG that can be used in other dataflows that has identical sub DAG as a prefix. We propose algorithms for merging a submitted streaming dataflow with deployed dataflows at specific points of equivalence, and similarly, unmerging a merged dataflow when it is removed, while guaranteeing their output stream consistency. We implement our reuse algorithms in Apache Storm, and validate it for real and synthetic Smart Utility applications and public OPMW workflows.
Currently, we are also exploring reuse of task as a service among different dataflows deployed in the system. This will allow reuse at the granularity of task level. The key challenge here is that since tasks can have state, we need to make sure state of each dataflow is not affected by state of other dataflows reusing the same task. Also we want to explore how provenance can play a role in tracing the consumption of resources among dataflows. This will help in fair billing of resources to the dataflows from different users that are being reused. It can also help in validating whether the output generated from dataflows with reuse is exactly same as the one generated when they run individually, without reuse.
Another potential benefit of reuse is in resilient processing of the dataflows, both from a reliability perspective as well as security against attacks.
- Shilpa Chaturvedi, Sahil Tyagi and Yogesh Simmhan, “Collaborative Reuse of Streaming Dataflows in IoT Applications” in IEEE International Conference on eScience, 2017
- Shilpa Chaturvedi and Yogesh Simmhan, “Toward Resilient Stream Processing on Clouds using Moving Target Defense”, in IEEE International Symposium on Real-Time Computing (ISORC), 2019 (Invited Paper)
Benchmarking Fast Data Platforms
Anshu Shukla, Shilpa Chaturvedi
There are different distributed commercial and open source stream processing platforms out there and each has its own design constraints, capabilities and limitations. In such scenarios, there is often a need to benchmark such Big Data platforms for performance and scalability metrics. This provides a baseline for Big Data researchers and developers to uniformly compare DSPS platforms for different domains and different workloads.
We classify different characteristics of streaming applications, their composition semantics, and their data sources. We propose categories of tasks that are essential for IoT applications and the key features of input data streams they operate upon. We identify performance metrics of DSPS that are necessary to meet the latency and scalability needs of streaming IoT applications. We propose the RIoTBench real-time IoT benchmark for DSPS based on representative micro-benchmark tasks. This work extends our earlier published as ‘Benchmarking Distributed Stream Processing Platforms for IoT Applications’.
Earlier, we proposed a stream processing workload, based on the Aadhaar enrollment and authentication applications, as a Big Data benchmark for distributed stream processing systems. We describe the application composition, and characterize their task latencies and selectivity, and data rate and size distributions, based on real observations. We also validate this benchmark on Apache Storm using synthetic streams and simulated application logic.
- RIoTBench: A Real-time IoT Benchmark for DSPS: https://github.com/dream-lab/riot-bench
- Bm-IoT: An IoT Benchmark for DSPS: https://github.com/dream-lab/bm-iot
- Yogesh Simmhan, Anshu Shukla and Arun Verma, “Benchmarking Fast Data Platforms for the Aadhaar Biometric Database,” in Workshop on Big Data Benchmarking (WBDB), 2015. doi:10.1007/978-3-319-49748-8_2
- Anshu Shukla and Yogesh Simmhan, “ Benchmarking Distributed Stream Processing Platforms for IoT Applications ,” in TPC Technology Conference on Performance Evaluation and Benchmarking (TPCTC), co-located with International Conference on Very Large Data Bases (VLDB), 2016.doi:10.1007/978-3-319-54334- 57
- Anshu Shukla, Shilpa Chaturvedi, Yogesh Simmhan, “RIoTBench: A Real-time IoT Benchmark for Distributed Stream Processing Platforms ,” in Concurrency and Computation: Practice and Experience, 2017.CoRR abs/1701.08530
CEP and Streams
Stream Processing frameworks like Apache Storm, Apache Spark Streaming, etc. are widely used to process realtime data. These distributed stream processors are used to tackle the Velocity dimension of Big Data. Continuous dataflows are generally directed-acyclic graphs, composed of tasks processing messages with some user-logic, and sending output messages to other tasks. Many of the mission-critical applications like smart power grid, smart transportation, etc. ingest and process messages in realtime. Update to such long-running applications may be needed, for example, bug fixes, feature enhancements, etc. Currently, the application suffers a significant downtime while performing such dataflow updates. This downtime is unacceptable for mission-critical applications. We are investigating new update strategies as well as enhancing existing continuous dataflow update strategies.
Sensors that generate immense volumes of data are growing both in physical and virtual space. They generate continuous streaming data that constitutes to the aspect of Big Data. Complex Event Processing (CEP) facilitates real time event patterns to extract meaningful information from data streams. CEP engines accept continuous queries and detect patterns on event streams. Analysis of streaming data is needed to make meaningful decisions. Processing and visualization of real time streaming data is highly important. Visualizing the streaming Big Data helps to communicate information clearly and efficiently in the form of graphs and charts. This pictorial representation helps the end users to understand the situation better. This technique communicates data in the form of visual objects.The visual representation of the insights gained from analysis should get updated as the data changes and display the newer results. We are investigating on the visual analytics of streaming Big Data in real-time.
- T. Sharma and Y. Simmhan, “Online-Update Strategies for Distributed Stream Processing in Apache Storm,” at 7th Student Research Symposium, in IEEE Conference on High Performance Computing (HiPC), 2014
- N. Govindarajan, Y. Simmhan, N. Jamadagni and P. Misra, “Event Processing across Edge and Cloud for Internet of Things Applications,” at 20th International Conference on Management of Data (COMAD), 2014