Feb 242009
 

Bryan has been frantically benchmarking and identifying bottlenecks with the scale-out RDF store on a cluster to which we have access for a limited time. Here is a snippet of what he had to say about it, taken from an email this morning:

We have been running some experimental trials with bigdata on a cluster of 14 machines collecting metrics on RDF data load, RDF closure rates, and RDF query using the LUBM synthetic data set. This is really a shakedown period and I do not believe that these numbers represent the potential of the system. Rather, they should be taken as evidence of the minimum throughput which we might expect to see on a heavily loaded cluster. Our goal is to perform runs on the order of 20B triples. Given the inherent complexity of distributed systems, we are taking this incrementally and doing 1B triple runs right now.

The cluster is a set of relatively hefty machines. Each has 32G of RAM, 8 cores, and around 100G of local disk available to the application. The application runs 13 “data services” — one per blade. The 14th blade runs a transaction service, zookeeper (a distributed lock management service), a “metadata service” which is used to locate index partitions on the data services, and a load balancer service. We are not using hypervisor or any similar kind of technology. Each JVM runs on its local machine. We use RMI between the JVMs using APIs established for the various services. All data (other than log files) is stored on local disk on the machines.

Bigdata is designed as a key-ranged partitioned database architecture. This means that it dynamically breaks down each of the indices for the application into key ranges and (re-)distributes those index partitions across the cluster in order to balance the load on the data services. The RDF database uses 5 indices: three statement indices and two indices for mapping URIs and literals into internal identifiers and then back into RDF resources.

A run begins by pre-partitioning the data based on a sample data set (I have been using LUBM U10, with 1.2M triples). This gives us some good initial index partition boundaries which helps us to initially distribute the workload across the cluster. We then start a master on one machine and it starts data load clients on the various data services. Given N data load clients, each data load client is responsible for incrementally generating and loading 1/Nth the data into the distributed database. We took this approach since we lacked sufficient shared disk to store the pre-generated data set. Instead, we dynamically generate the data on a “just in time” basis.

Data load is linear out to at least 1B triples. Since the system uses B+Trees we expect log-linear performance as IOWAIT begins to dominate. In fact, we have currently identified a bottleneck in the “lock manager” and we are working to remove that bottleneck right now, which should increase the concurrency of the system and its throughput dramatically, in which case we will try a 10B triple run. Right now, we are observing between 50k and 60k triples per second during data load out to 1B triples (depending on the size of the thread pool used by the data services).

A sample run summary is inline below. Loading 1B triples takes approximately 6 hours, and we hope to improve on that shortly. You can also see the time required to compute the eager closure of the RDFS entailments and the net time and triples per second to load and compute the closure of the LUBM U8000 data set. I have not been tracking bytes per triple explicitly, but examining one point in the run with 821M triples loaded the on disk requirements were 89G across the cluster, which is 108 bytes per triple. That might be a reasonable value during a period of heavy write activity when there is a lot of data buffered on journals and the index partitions views are not compact. However, I expect that this goes down to about 50 bytes per triple when the system is at rest (after performing compacting merges on the index partitions and releasing unnecessary history).

Thanks,

-bryan

Fri Feb 20 17:12:39 GMT-05:00 2009
Load: tps=51038, ntriples=1154639857, nnew=1154639421, elapsed=22622977ms
namespace U8000b
class com.bigdata.rdf.store.ScaleOutTripleStore
indexManager com.bigdata.service.jini.JiniFederation
statementCount 1154639857
termCount 263124825
uriCount 173797593
literalCount 89327232
bnodeCount 0

Computing closure: now=Fri Feb 20 23:29:46 GMT-05:00 2009
closure: ClosureStats{mutationCount=272469320, elapsed=13235241ms}
Closure: tps=19940, ntriples=1418566071, nnew=263926214, elapsed=13235355ms
namespace U8000b
class com.bigdata.rdf.store.ScaleOutTripleStore
indexManager com.bigdata.service.jini.JiniFederation
statementCount 1418566071
termCount 263124825
uriCount 173797593
literalCount 89327232
bnodeCount 0

Net: tps=39556, ntriples=1418566071, nnew=1418565635, elapsed=35861664ms
Forcing overflow: now=Sat Feb 21 03:10:23 GMT-05:00 2009
Forced overflow: now=Sat Feb 21 03:15:10 GMT-05:00 2009
namespace U8000b
class com.bigdata.rdf.store.ScaleOutTripleStore
indexManager com.bigdata.service.jini.JiniFederation
statementCount 1340505170
termCount 263124825
uriCount 173797593
literalCount 89327232
bnodeCount 0

Very exciting results! Keep up the good work Bryan!

  7 Responses to “a quick update from cluster land”

  1. Hi,

    sounds like I can’t stand anymore :)

    Do you plan to release soon ? I would like to play with it :)

    Alexandre.

  2. The results look really good! How does Bigdata perform with write operations in a cluster? Does it support concurrent transactions?

    I have been working with Sesame to add transaction isolation levels in the API and I would be interested in getting Bigdata’s perspective on them.

  3. bigdata has always been envisoned as a distributed platform. Right now, we are working through some thoughput issues on a modest sized cluster. We plan to do a release once we have that problem nailed. You are welcome to checkout the HEAD from CVS. See http://bigdata.wiki.sourceforge.net/GettingStarted for the most common issues people have when they first checkout bigdata.

    Writes are transparently key-range partitioned and distributed to the cluster. The current bottleneck can be traced back to not having the “good” index partitions so one machine has to do more work. The POS index is the main problem right now, so we are looking at ways to make sure that it is well partitioned early on in the run.

    I’ll say a few words here about transactions. bigdata is an MVCC model, so we retain historical commit points if they are open transactions which can read from those commit points. The transaction service tracks the last release time (the earliest commit time whose data are no longer required) and historical data are periodically released by the data services. There is a configuration option for the transaction service which lets you specify how many milliseconds of history you want to retain. You can set this to Long.MAX_VALUE for an immortal database or you can leave it at 0 to retain only the most recent commit point.

    When we do closure without isolation, what we do is establish read-only transactions on the commit point corresponding to the start of each round of the closure operation. That prevents the data required for the closure from being released as they are superceded by newer commit points either on the RDF graph or on other indices in the federation.

    One model for combining bulk data load with closure without transactions is therefore to update the commit point from which queries are answered each time a new closure is obtained for the graph. We are also looking into an integration with IRIS so we can offer mixed models using query-time inference rather than eager closure or incrementally computing the closure.

  4. I am wondering for what sort of workload bigdata is designed for. I am planning biuld a website that needs to have a form of distributed data and need to store user profiles, blogs, news, forum and that sort of entries.

    Would bigdata be a good candidate?

    Ries

  5. Reis,

    In principle, yes.

    If you are going to scale-out for your project, you will need to have a fixed addressing scheme where by you can lookup the “content” pieces from the URLs. You could use either a sparse row store or RDF for this. You would store data under the user identity (for the user) and the URL (for their blog, websites, etc). For the sparse row store, the associated data would be column values. For RDF, it would be property values. Pretty much the same thing for this sort of application. RDF becomes the “right choice” when modeling graph data (linked nodes with property values on the nodes) and when you need high-level query over the data (using SPARQL).

    The content pieces would have URIs (of course) and you would store those URIs as attributes in the sparse row store or the RDF DB. The content itself needs to be resolved from those URIs. BFS (the bigdata file system) is the scale out component designed to handle this issue. However, it is not ready for deployment at this time. You can’t just stick the data in local file systems since you will not be able to resolve if from any machine in a cluster. You could use HFS (hadoop file system) for this of course.

    We are focused right now on RDF performance. We hope to finish up the RDF scale-out performance issues over the next month or so and focus on BFS performance (the file system) a little later this year. At that point, I would say that bidata would be an option for your system.

  6. Bryan,

    I’m curious to know what your plans are with BFS and, more importantly, what will BFS offer over Hadoop’s HDFS?

    On a related note, I notice that you have a 0.8 release on SourceForge but it’s not clear as to how one would fire up a BigData node to get going. Are there any docs anywhere that detail how to get started with it? Thanks!

    Ryan-

  7. Ryan,

    Most often, databases are layers over the file system. BFS is a layer over the database architecture. As such, it has access to the same semantics for updates and failover. For example, atomic operations are defined for adding blocks to the head or tail of a block-structured file which may be used to create queues for map/reduce programs. That said, BFS is not ready for deployment. If you want a distributed file system today, I would recommend sticking with HFS unless there are features which you can not get there.

    Our 1.0 goal is a feature complete high-performance scale-out RDF database layer with SPARQL end points. This is the single feature not found in other scale-out data architectures and will go a long way toward filling in a gap in capabilities of open source distributed data architectures. We will turn to service failover and a deployable BFS once that is in hand.

    Currently, we are integrating a new API for asynchronous writes on the distributed indices for better scattered write performance. We are testing on a cluster with 16 nodes and starting with ~ 40 partitions for each scale-out index. With high scatter, each index partition write tends to be small which decreases throughput. To address this, we have introduced an async write API. It aggregates writes on the client onto per-index partition queues so that the writes on the index partitions have decent chunk size.

    See the following link on the wiki for getting started with bigdata. It does not cover the scale-out node deployment, but we plan to update that as soon.

    http://bigdata.wiki.sourceforge.net/GettingStarted

    -bryan