Big data management and analytics has gained significant traction recently and its value is being recognized in the scientific and web communities. Of the commonly references dimensions of Volume, Velocity and Variety for Big Data, the former – Volume – is well examined while distributed computing platforms and frameworks to manage, collect, store and analyze data with high Velocity and Variety are less explored. In this space, we are currently investigating scalable Big Data platforms on Clouds & emerging distributed systems for analytics over complex inter-connected Graph datasets, and realtime analytics over ‘Fast Data’ that arrive continuously for IoT applications.
Scalable Graph Analytics
Vertex-centric distributed graph programing models, such as Google’s Pregel/Apache Giraph, are similar to MapReduce but for large graph datasets. Here, the computation logic is developed from the perspective of a single vertex and executed in a Bulk Synchronous Parallel (BSP) pattern. We have proposed a sub-graph centric distributed processing model for graphs that offers the additional flexibility of using shared memory algorithms to deliver significant performance benefits, as demonstrated in our publications. We operate in a commodity cluster/Cloud environment where the graph is k-way partitioned over its vertices across k machines, and independent tasks operate on subgraphs — weakly connected components — within each partition. The GoFFish software platform is a prototype implementation of this abstraction.
As ongoing research, we are modeling the complexity of such subgraph-centric graph algorithms to help predict runtime behavior and optimize their scheduling and partitioning in distributed environments. We are also extending existing shared-memory graph algorithms to subgraph-centric distributed algorithms to significantly improve their performance.
There is a novel and emerging class of dynamic graphs, such as in social networks where new members join (new vertices), new friends are made (new edges), and messages are posted (vertex and edge values change). We explore the temporal dimensions of changing graphs, that we term time-series graphs, with proposed programming abstractions to develop new time-series graph algorithms, perform graph querying, and scale their storage and analysis on distributed systems. This is an extension to GoFFish.
These research outcomes are motivated by and can be applied to domains such as next-generation sequencing in genomics, and analysis of social networks, Internet of Things (IoT).
- Distributed Programming over Time-series Graphs, IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015
- Analysis of Subgraph-centric Distributed Shortest Path Algorithm, International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics (ParLearning), 2015
- Subgraph Rank: PageRank for SubgraphCentric Distributed Graph Processing, International Conference on Management of Data (COMAD), 2014
- GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics, International European Conference on Parallel Processing (EuroPar), 2014
- M. Redekopp, Y. Simmhan, and V. K. Prasanna, Optimizations and Analysis of BSP Graph Processing Models on Public Clouds, IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013
Distributed Stream & Event Processing
Defining applications over streaming data, i.e. data arriving continuously from instruments and sensors, is of growing importance as the Internet of Things becomes a reality. Such applications span exploratory science and mission-critical operational needs. Continuous Workflows and dataflows are generalized programming abstractions for eScience that offer flexibility in composing data transformation and analytics pipelines. However, there is limited support for continuous and scalable processing of data streams. We propose a continuous dataflow composition model based on a directed task graph whose nodes are the application logic and the directed edges represent channels that stream data between them. Besides primitive dataflow patterns, we support single, streamed and windowed execution, and streaming Map-Reduce and Bulk Synchronous Parallel models. The Floe Continuous Dataflow Engine is a prototype implementation of this model for Cloud environments.
In this context, we are examining research opportunities on reliably scheduling such dataflows on elastic and heterogeneous resources to meet mission critical QoS needs, performing online updates on the dataflow to balance its consistency, reliability and performance, and developing models for defining realtime analytics over such continuous data streams.
These research problems are motivated by pervasive sensor deployments found in cyber-physical systems such as smart power grids,smart transportation and urban monitoring.
- Reactive Resource Provisioning Heuristics for Dynamic Dataflows on Cloud Infrastructure, IEEE Transactions on Cloud Computing (TCC), 2015 [(To Appear)]
- Fault-Tolerant and Elastic Streaming MapReduce with Decentralized Coordination, IEEE International Conference on Distributed Computing Systems (ICDCS) , 2015 [To Appear]
- Event Processing across Edge and the Cloud for Internet of Things Applications, International Conference on Management of Data (COMAD) , 2014
- C. Wickramaarachchi and Y. Simmhan, Continuous Dataflow Update Strategies for Mission-Critical Applications, IEEE International Conference on eScience (eScience), 2013.
- A. Kumbhare, Y. Simmhan, and V. Prasanna. Plasticc: Predictive look-ahead scheduling for continuous dataflows on clouds, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2014.
- A. Kumbhare, Y. Simmhan, and V. Prasanna, Exploiting Cloud Elasticity to Enhance the Value of Dynamic, Continuous Dataflows, IEEE/ACM International Conference for High Performance Computing Networking, Storage, and Analysis (SC), 2013.
- Towards Hybrid Online On-Demand Querying of Realtime Data with Stateful Complex Event Processing, IEEE International Conference on Big Data (BigData), 2013
Scheduling on the Cloud
The research is about developing different algorithms for Job and Dataflow scheduling on public and private Cloud. We are investigating different type of jobs like I/O intensive, compute intensive, massively parallel jobs etc. The goal is to optimization utilization percentage, cost, job completion time, job resilience etc by maintaining resilience. Also of interest are costing models on Clouds, such as spot pricing, and VM placement strategies for reducing the energy footprint.
- V. Kushwaha, and Y. Simmhan, Cloudy with a Spot of Opportunity: Analysis of Spot-Priced VMs for Practical Job Scheduling, in Cloud Computing for Emerging Markets (CCEM), 2014.
- Cost-efficient and Resilient Job Life-cycle Management on Hybrid Clouds, IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014
Software Platforms for Internet of Things
Internet of Things (IoT) offers a challenging and emerging application area that can transform our lifestyle and the way physical systems operate. IoT when deployed at scale is information-centric, and generates millions of streams of events from networks of physical and virtual sensors that need to be processed, analyzed and acted upon. Smart Cities are one example of IoT deployments, but these can also be at smaller campus-wide scales. These help manage resources such as power and water efficiently, manage transportation routes to reduce congestion, and even offer smart homes where with smart appliances that can brew your coffee even as you walk towards it! The platforms and analytics required to enable such intelligence are just emerging, and IoT is inherently a distributed system with significant potential for research in this socially relevant space.
Visit the project page at SmartX.cds.iisc.ac.in .
- Event Processing across Edge and the Cloud for Internet of Things Applications, International Conference on Management of Data (COMAD), 2014
- Holistic Measures for Evaluating Prediction Models in Smart Grids, IEEE Transactions on Knowledge and Data Engineering (TKDE) , 27(2), 2015, pp. 475-488
- Cloud-based Software Platform for Data-Driven Smart Grid Management, IEEE/AIP Computing in Science and Engineering , July/August, 2013, pp. 1-11, IEEE and AIP.
- Energy Management Systems: State of the Art and Emerging Trends, IEEE Communications Magazine , 51(1), 2013, pp. 114 – 119, IEEE.
- Scalable Prediction of Energy Consumption using Incremental Time Series Clustering, Workshop on Big Data and Smarter Cities , 2013
M.Tech. & M.E. Research Project Topics (2014-15)
The following research topics are available as final year projects for M.Tech. and M.E. students from CDS department. These projects will place an emphasis on innovative research ideas as well as practical grounding through software prototyping and benchmarking on Cloud and distributed systems. Students will be expected to publish a research paper as a project outcome. As part of the project, students will review research literature, identify novel ideas, explore algorithms and optimizations, and develop software design and prototypes that will be empirically evaluated on real public and private Clouds. Knowledge of Java (or proven expertise in another high level language) is expected.
- Fog Computing for Complex Event Analytics
High velocity event streams data from Big Data applications like Internet of Things need to be processed in a distributed manner across edge devices like smart phones and Raspberry Pis which often the data stream source, and on the Cloud. This form of distributed execution across edge and Cloud is called “Fog Computing”. This project will examine distributed runtimes and scheduling of complex event processing using Fog Computing, and apply it to a Smart Campus project being conducted at IISc.
- Distributed Graph Processing
This project will investigate Big Data platforms and distributed algorithms for large graphs such as social, knowledge and transportation networks. This includes developing new spatio-temporal analytics algorithms for traversals and mining, analyzing and tuning them for distributed execution, and performing comparative benchmarks on real graphs against Apache Giraph using public and private Clouds.
- Big Data Platforms for Internet of Things
Internet of Things (IoT) is becoming a major source of Big Data as millions of sensors are deployed globally. This project will investigate the unique needs of Big Data platforms such as stream processing, NoSQL databases, visual analytics and runtime on smart phones to support IoT applications. The platforms and software architecture will support the Smart Campus project being conducted at IISc.
Students who enroll and out-perform in the ‘Introduction to Cloud Computing’ Course, or the ‘Scalable Systems for Data Science’ Course in Jan semester at CDS will be given preference. If you are interested, contact Yogesh Simmhan.
Research and teaching activities of the lab is supported by grants from the Robert Bosch Centre for Cyber Physical Systems (RBCCPS), Department of Electronics and Information Technology (DeitY), Microsoft Windows Azure in Education, NetApp Inc. through a Faculty Fellowship and Tech Mahindra. Past supporters of our research have included Amazon AWS in Education and Google Cloud Platform.