There is a heightened focus on Big Data management and analytics as an essential part of Data Science. Complex applications that operate on Big Data are composed using programming abstraction specialized for specific types of data or analytics to ease the development of composable applications, and map them to an execution runtime on specific hardware. While MapReduce is one of most popular such programming model for tuple-based data, scalable abstractions to support the Velocity and Variety dimensions are still gaining traction. In this context, we examine abstractions for analytics over continuous data streams and interconnected graph data sets, as well as distributed algorithms that can effectively use these programming models.puzzle

Distributed Graph Processing

Graph datasets are becoming pervasive, be they from social or transportation networks, with dynamic and time-series graphs being common. Billion node/edge graphs are not uncommon. Analytics over large, complex, interconnected network data require special abstractions that go beyond MapReduce, such as vertex-centric programming models like Google’s Pregel and Apache Giraph. However, Vertex-centric graph computation have performance limitations while also failing to capture the temporal aspects required for dynamic graphs. Our work on subgraph-centric programming models that operate over time-series graphs, as part of the GoFFish Project, address several novel research contributions for distributed graph processing, including developing temporal graph algorithms, runtime scheduling and partition, and querying over such graphs.

Distributed Stream & Event Processing

There is wide-spread data arriving constantly from sensors as the Internet of Things becomes a reality. Smart Utilities, with smart power and water meters can sample energy and eater consumption for millions of users in realtime, and form cyber-physical systems (CPS) that have an OODA (Observe Orient Decide Act) cycle requiring online analysis. Continuous Dataflows or Distributed Stream Processing systems offer programming abstractions for flexible composing of streaming applications using building-block tasks. There is limited support for continuous execution over data streams, with support for powerful constructs as well as allowing for transparent execution at scale [iv],[v],[vi],[vii],[viii]. Open problems in continuous dataflows [ix] include support for including alternate logic for steering computation in dynamic applications [x], offering users control over QoS and cost trade-offs, performing online updates while ensuring consistency and reliability [xi], tracking provenance for verifiability , spanning low level streams and higher order events.