Oct 282009
 

There were two excellent presentations yesterday at ISWC 2009 on using parallel techniques to materialize the RDFS closure at extremely high rates (for example, the RDFS closure of U8000 in 15 minutes). This is something that we are going to try out as soon as possible. Unlike either of the systems described in these papers, bigdata using automatic dynamic sharding based on key-ranges of the data. These techniques can be adapted by mapping the computation onto the POS index shards, bringing them to fixed point, and then reusing our high-throughput data loader to quickly relocate the entailments onto the distributed indices. There is clearly a surge in parallel and distributed algorithms for the semantic web, which is extremely exciting.

[1] Jesse Weaver, James A. Hendler. Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples, In Proceedings of the 8th International Semantic Web Conference, pp. 682–697, 2009.

[2] Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen. Department of Computer Science, Vrije Universiteit Amsterdam, the Netherlands, Scalable Distributed Reasoning using MapReduce, In Proceedings of the 8th International Semantic Web Conference, 2009.

Oct 272009
 

We’ve just released a new version of bigdata. This release is capable of loading 1B triples in under one hour on a 15 node cluster and has been used to load up to 13B triples on the same cluster. JDK 1.6 is required. See [1] for instructions on installing bigdata(R), [2] for the javadoc and [3] and [4] for news, questions, and the latest developments.

Please note that we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script. You can checkout this release from the following URL:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/RELEASE_0.81b

New features:

  • Support for quads.
  • The B+Tree data record is now the same on the disk and in memory, eliminating de-serialization costs for immutable B+Tree records.
  • Shared LRU for all B+Tree instances in the same JVM. This provides competition across those B+Tree instances for RAM. The LRU is configured by default to use 10% of the JVM heap, but that value may be increased to 20% even on very heavy workloads with large heaps.
  • Parallel iterator for distributed access paths.
  • 3x improvement in distributed query performance. We have only just begun to optimize distributed query. This performance improvement is mainly due to the parallel access path iterator, the shared LRU, and the selection of a better chunk size for distributed query. We expect substantial improvements in query performance over the next several months.
  • Some query hotspot elimination.

The roadmap for the next release includes:

  • Full transactions for the SAIL.
  • Record level compression.
  • Query optimizations.

For more information, please see the following links:

[1] http://bigdata.wiki.sourceforge.net/GettingStarted
[2] http://www.bigdata.com/bigdata/docs/api/
[3] http://sourceforge.net/projects/bigdata/
[4] http://www.bigdata.com/blog

About bigdata:

Bigdata

Oct 012009
 
We’ve recently added quad support to bigdata and integrated that support with Sesame 2.x. There are now three major database modes for RDF data: plain triples, plain triples with provenance (statement identifiers), and quads. Inference is not supported yet for quads. We plan to introduce query time inference support instead of eager closure.

The support for quads is in the development branch [1], which you can check out from SVN. This branch is fairly solid and we plan to create a release from it soon [2].

[2] https://sourceforge.net/apps/trac/bigdata/roadmap