Distributed Graph Storage, Processing and Querying

GoFFish: Subgraph-centric Distributed Graph Processing

Motivation

Vertex-centric distributed graph programing models, such as Google’s Pregel/Apache Giraph, are similar to MapReduce but for large graph datasets that number in the billions of vertices and edges. Here, the computation logic is developed from the perspective of a single vertex and executed in a Bulk Synchronous Parallel (BSP) pattern. Apache Giraph is a popular open-source implementation of this abstraction, while Apache Hama and Spark’s GraphX also support it. While the vertex-level granularity of Pregel exposes significant data parallelism, it is also a limitation — the number of supersteps taken for, say, traversal algorithms approaches the diameter of the graph, and the messaging overhead between vertices can be high. Further, we also fail to leverage the large corpus of existing shared-memory graph algorithms.

Contributions

We have proposed a subgraph-centric distributed processing model in GoFFish that deliver significant performance benefits compared to Giraph. Here, a weakly-connected component in the graph (which we call subgraph) is the unit of execution of the user logic. This allows shared memory graph algorithms to be easily leveraged, unlike a partition-centric approach that Giraph++ takes.  It also avoids message-based communication between vertices in the same subgraph. The subgraphs themselves follow the Pregel model for iterative execution across supersteps, but converge much more quickly, typically the diameter of the metagraph that is formed by this coarser structure. GoFFish v3 is available as an open source platform, and has been fully re-implemented to separate the subgraph-centric API from the runtime, with the algorithms developed just using the APIs. The runtime engine itself is now mapped to Apache Giraph and also Apache Hama to leverage their tooling. We have also developed several novel distributed subgraph-centric graph algorithms that is available as part of the codebase, and described in our publications.

GoFFish v1 was developed at USC by Profs.Yogesh Simmhan and Viktor Prasanna’s group as part of a DARPA XDATA grant. It continues to be developed and maintained by Prof. Simmhan and the DREAM:Lab at IISc.

Software

Publications

  • Y. Simmhan, N. Choudhury, C. Wickramaarachchi, A. Kumbhare, M. Frincu, C. Raghavendra, and V. Prasanna, “Distributed Programming over Time-series Graphs,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015. doi:10.1109/IPDPS.2015.66
  • R. Dindokar, N. Choudhury, and Y. Simmhan, “Analysis of Subgraph-centric Distributed Shortest Path Algorithm,” in International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics (ParLearning), 2015. doi:10.1109/IPDPSW.2015.87
  • N. C. Badam and Y. Simmhan, “Subgraph Rank: PageRank for SubgraphCentric Distributed Graph Processing,” in International Conference on Management of Data (COMAD), 2014.
    [Download PDF]
  • Y. Simmhan, A. Kumbhare, C. Wickramaarachchi, S. Nagarkar, S. Ravi, C. Raghavendra, and V. Prasanna, “GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics,” in International European Conference on Parallel Processing (EuroPar), 2014. doi:10.1007/978-3-319-09873-9_38
  • M. Redekopp, Y. Simmhan, and V. K. Prasanna, Optimizations and Analysis of BSP Graph Processing Models on Public Clouds, IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2013

GoDB: Distributed Querying & Storage of Dynamic Graphs

Abhilash Sharma, Jayanth Kalyanasundaram, Sayandip Sarkar

Motivation

Property Graphs with rich attributes over vertices and edges are becoming common. Querying and mining such linked Big Data is important for knowledge discovery and mining. Distributed graph platforms like Pregel focus on batch execution on commodity clusters. But exploratory analytics requires platforms that are both responsive and scalable. As such, there is inadequate work on distributed graph databases that offer the scalability of batch systems, and the query capabilities and short response time O(secs) required for interactive processing. This ability is necessary for interactive knowledge mining to discover new concepts and relationships, to perform queries over the graph to drive web and social network searches, and for declarative specification of graph analytics. We address this gap by extending our subgraph-centric distributed graph processing platform, GoFFish, with support for declarative graph querying that is both responsive and scalable.

Another factor to take into consideration when modelling social networks, road networks etc. using property graphs is their temporally changing nature. Thus, work is being done on a novel temporal property graph storage framework HIPPO(Hierarchy, fIlter, Project, Partition, Order) that makes use of the said functions to define how graph entities will be stored in the filesystem. 

Contributions

We propose Graph-oriented Database (GoDB), a distributed graph database that supports declarative queries over large property graphs. GoDB builds upon our GoFFish subgraph-centric batch processing platform, leveraging its scalability while using execution heuristics to offer responsiveness. The GoDB declarative query model supports vertex, edge, path and reachability queries, and this is translated to a distributed execution plan on GoFFish. We also propose a novel cost model to choose a query plan that minimizes the execution latency. We evaluate GoDB deployed on the Azure IaaS Cloud, over real-world property graphs and for a diverse workload of 500 queries. These show that the cost model selects the optimal execution plan at least 80% of the time, and helps GoDB weakly scale with the graph size. A comparative study with Titan, a leading open-source graph database, shows that we complete all queries, each in < 1.6 secs, while Titan cannot complete up to 42% of some query workloads.

With HIPPO we aim to contribute the following:

  • Separate logical view of temporal property graph from their physical storage.
  •  Efficient storage on external block store:e.g. HDFS, etc.
  •  Leverage existing block storage patterns :Parquet, Columnar, RLE, Indexes
  •  Compose higher order storage designs:Copy+Log, Delta Update, Immortal Graph, etc.

This work is supported by a faculty grant from NetApp.

Publications

  • N. Jamadagni and Y. Simmhan, “GoDB: From Batch Processing to Distributed Querying over Property Graphs,” in IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid), 2016. doi:10.1109/CCGrid.2016.105

 

Modeling & Elastic Scheduling of Graph Analytics

Ravikant Dindokar & Neel Choudhury

Motivation

Executing distributed algorithms on large graphs is time consuming. As a result, having estimates on the computational, memory, I/O and coordination complexity of different types of graphs for diverse algorithms is essential to understand their behavior before executing them, and to also plan their execution on distributed and elastic computing resources. While the complexity of graph algorithms have been well studied, there is limited work on analyzing the complexity for algorithms designed using popular component centric graph programming models like Pregel and GoFFish.

Contributions

We have proposed a meta-graph model as a coarse-grained sketch for representing large partitioned graphs, where meta-vertices are connected components within a partition and meta-edges represent links that exist between vertices in the meta-vertices. We have analyzed their various time and space complexities for distributed vertex and subgraph-centric algorithms algorithms such as Breadth First Search and PageRank, for spatial and power-law graphs. We have also used the outcomes of these models to plan the execution of non-stationary graph algorithms by changing the partition mappings to an elastic number of Virtual Machines in the Cloud to reduce the cost for execution while minimizing the increase in the execution time.

Publications

  • R. Dindokar and Y. Simmhan, “Elastic Partition Placement for Non-stationary Graph Algorithms,” in IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid), 2016. doi:10.1109/CCGrid.2016.97
  • R. Dindokar, N. Choudhury, and Y. Simmhan, “A Meta-graph Approach to Analyze Subgraph-centric Distributed Programming Models,” in IEEE International Conference on Big Data (Big Data), 2016. doi:10.1109/BigData.2016.7840587
  • R. Dindokar, N. Choudhury, and Y. Simmhan, “Analysis of Subgraph-centric Distributed Shortest Path Algorithm,” in International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics (ParLearning), 2015. doi:10.1109/IPDPSW.2015.87