Mar 162010
 

We presented at a Semantic Data Management workshop last week in Bulgaria [1,2]. This was a great event which brought together semantic web researchers and industry concerned with very large scale semantic web applications and from both traditional and semantic web database communities. There were a number of great presentations and excellent discussions throughput.

There were several interesting outcomes for us from a series of sessions on database architecture during the workshop. In summary, we believe that we can realize:

- Significant improvements in write rate and online query answering (order preserving coding of data type literals into the statement indices rather than indirection through forward and reverse dictionaries);
- 2-3 orders of magnitude improvement on relatively unselective aggregation style queries (GPU based shard-wise joins at the maximum disk transfer rate, which is inspired by the cache conscious processing of MonetDB); and
- 3 orders of magnitude improvement in the potential scale of the database architecture (exabyte scale through the decomposition of the shard locator service across the data services).

Exciting stuff!

[1] http://www.semdata.org/
[2] http://www.semdata.org/events/2010/sofia/

Mar 102010
 

Over the years I have come to realise that few ideas are successful for the reasons they were originally developed.

Sometimes you just have to ask the question “How did we get here?”.

What makes XML a success? Is it down to a unique ability to represent structured data? I think not. It is more likely because it is a lot like HTML and there have developed a range of available tools to parse and process the format. Why is HTML a success? Motivationally because of its de-facto use in markup and the magical result of browsers rendering the documents, but also because it is really easy to do something really simple and then not much harder to do something a little more complicated.

So what about RDF? RDF takes a ride on XML acceptance and introduces a triple-based data model. Where XML provides hierarchical data modelling, RDF provides a simple list of triples (at least theoretically). With this simplicity comes the problem… interpretation. XML naturally encapsulates data whilst RDF enables expansion with assertion of disconnected triples. We must follow a set of rules to interpret the triples to form coherent data structures, and these rules must be understood whenever data is created or accessed.

With BigData we want to be able to store really large amounts of information with flexible representation. A triple store provides for the flexibility, but we must be careful not to create complex problems of interpretation.
My strong feeling is that too much debate has been focused on RDF syntax and its interpretation. Instead we should understand that a triple store can flexibly represent data structures and then discover how we are able to represent the data we want to use.
So, I am arguing for a general re-appraisal of the approach to building RDF-based applications. One that essentially defocuses from RDF but is able to leverage the underlying representation when appropriate.
Is it really the case that we want to make SPARQL queries against RDF data? Of course, if we do have an underlying RDF representation then such queries could be made, but is this really the goal? When a developer is tasked with displaying a list of products purchased by a customer, do they want to make a SPARQL query or do they just want the list products?

Many years ago someone coined the term “impedance mismatch” to describe the problem of transforming data between the format used by a programming language and that used to represent it externally, for example in a database. A figure that I heard repeated was that 90% of computer code was involved with data transformation. I suspect this figure is now much higher since we no longer have to worry only about data storage transformations but also other representations. Discussing this recently with respect to RDF the phrase “mother of all impedance mismatches” was coined. We’re going to call this MAIM!

So what does this mean for BigData? Well, we are aiming to solve MAIM by providing a toolset (interfaces and metadata) that enables the easy creation of domain specific models. We will use an underlying triple representation augmented with indices to support the efficient access to domain data. The flexibility of the “schema-free” triple-based representation will enable data sharing between different models/ontologies, while the domain specific metadata will resolve issues of interpretation in defined contexts.

A key advantage of this approach is that the triple representation and custom indices are driven by the requirement to support access patterns for the domain model. So, along with solving MAIM, we also hope to save on time spent discussing which RDF representation should be used.
Mar 052010
 

Want to get paid to work on bigdata? Know someone who does? We are hiring! Here is our job listing:

Senior software engineer for parallel semantic web database query optimization.

SYSTAP, LLC is seeking a senior software engineer with experience in the Semantic Web, parallel databases, and distributed systems to help develop bigdata