May 302014

Here is a post I just did on the gremlin-users Google Group. Thought it might be of interest to a wider audience as well.!forum/gremlin-users


On Thursday, May 29, 2014 12:30:20 PM UTC-6, Jack wrote:

Would you and Marko care to explain the big differentiators between Titan and BigData?



I have never done a comprehensive side-by-side with Titan/Cassandra but I can tell you a little about what Bigdata offers. The genesis for Bigdata was the BigTable paper Google did back in 2006. At that time we were very interested in using graphs and particularly the semantic web to facilitate dynamic federation and semantic alignment of heterogeneous data within a schema flexible framework, all at scale. The landscape for graph databases was quite different back then, and within the semantic web community the real focus was semantics, not scale. Using the principles Google outlined for BigTable, we designed Bigdata from the ground up to be a massively scalable distributed database specifically for graphs.

Bigdata at its core is really a KV store – all data is ultimately persisted inside unsigned byte[] key -> unsigned byte[] value BTree indices. These indices can be key-range sharded and dynamically distributed and load balanced across a cluster. On top of the KV layer, Bigdata has a graph database layer that supports the RDF data model. Graph data is triple indexed to achieve perfect access paths for all eight possible graph access patterns. This concept of covering indices was developed by Andreas Harth and Stefan Decker back in 2005 as part of their YARS database. Using these indices, we can support arbitrary graph pattern joins at the core of our query engine, which supports the high-level query language SPARQL, the only open-standard we have as a community for graph query. In the scale-out mode of the database, the query engine pushes intermediate solutions out across the cluster and executes joins at the data. I think this is probably the key differentiating feature over other graph databases written on top of existing KV stores like Cassandra and Accumulo – these implementations tend to use a client-based query controller (if they provide a query controller at all) that pulls data across the network and does joins at the controller, instead of pushing the computation out to the data. This client-controller strategy can result in a huge amount of wasted IO and network traffic.

Bigdata was designed from the ground up to perform well as both a single-server and distributed database. Bigdata as a single-server database supports up to 50 billion RDF statements (each statement translates roughly to one vertex, edge, or property). Single-server mode comprises several deployment modes as well – you can stand up Bigdata as an actual server and access it via its REST API. There is a BigdataGraphRemote Blueprints implementation that wraps this mode. You can also use Bigdata embedded inside your application’s JVM. There is a BigdataGraphEmbedded Blueprints implementation that wraps this mode.

I suppose the key difference between Titan and Bigdata might be Bigdata’s query optimizer. With a query optimizer you can give the database an operator tree in the form of a declarative query and the query optimizer can do things like use cardinalities of the different operators to determine an optimal execution strategy. Without this all a database can do is exactly what you to tell it to do in exactly the order you tell it to do it, which oftentimes is not the best order at all. This leads to “hand-optimization” to get queries to run quickly (I want to combine these five different predicates to select vertices or edges but I have to re-order them myself to get the best performance). We’ve done a little bit of work exposing the query optimizer in the BigdataGraph implementation – we provide a custom GraphQuery implementation that will let you specify predicates to be executed by the query engine with the help of the query optimizer. We would like to expose other aspects of Bigdata’s query engine/optimizer through Blueprints as well.

I am also not aware of an HA architecture for Titan, but this might be my own ignorance. Bigdata has a high-availability architecture based on a quorum model. In this mode all data is kept on all nodes so queries are answered locally and can be load-balanced to achieve perfect linear scaling for read. Data is replicated using a low-level write replication pipeline with a 2-phase commit. The cluster requires a simple majority quorum to be available. Nodes can go offline temporarily and come back, or new nodes can join in their place. Re-joining or new nodes will catch up to the current state using a playback of the missing commits from the current quorum leader. If the quorum leader goes down this role will failover automatically to another node.

One other interesting feature of Bigdata is an integrated graph analytics engine that supports the Gather-Apply-Scatter (GAS) API. GAS lets you do interesting traversal based analysis – generally operations written on top of BFS or multiple BFS passes. The canonical examples are Page Rank and Shortest Path (SSSP). The GAS engine is integrated directly into Bigdata’s query engine, although this functionality is not yet exposed through the Blueprints API. I’ve spoken with Marko about how this might get exposed in TinkerPop3. In the meantime you can always work with your graph data inside Bigdata through any of the other APIs as well (you are not limited to only Blueprints or only Sesame or only the Bigdata Workbench for a particular database instance).

We also did a literature review of the state of the art in graph database technology about a year ago that might further address your question. There is an entire section on KV stores and Map/Reduce systems.

The Blueprints implementation over Bigdata is very new and we are very excited to get it out into the community and get feedback on it. We welcome comments and suggestions and please do let us know if you find any problems. The best thing to do with an issue is to post it to our trac system, preferably with a small self-contained test case that demonstrates the issue.


Mike Personick
Core Bigdata Development Team