Jun 072010
 

Online query workloads (see my previous post) can be quickly answered using an index. Such queries touch relatively little data on the disk. Using an index, it is possible to rapidly read only the data on the disk necessary to answer the query. In contrast, analytic queries touch large volumes of the disk. If you attempt to answer an analytic query using a B+Tree, the disk utilization will spike to nearly 100%. However, this is not good news. The disk is being saturated with seeks, which are relatively slow operations, and the data transfer rate will not approach the capabilities of the device. SSD offers one path around this bottleneck, by reducing the latency of a disk seek operation, but SSD is much more expensive than traditional disk — especially at the scale of a data warehouse. Instead, analytic queries are answered by sustained IO operations which read more data than you really need, but the data is transferred at or near the bandwidth capabilities of the device. Rather than selecting just the relevant B+Tree leaves using an index, irrelevant data are filtered out once they reach main memory.

Column stores (MonetDB, C-Store) offer a different approach to analytic queries – one designed to maximize the disk transfer rate. Rather than using a B+Tree index and suffering disk access saturation, the physical organization of the data on the disk is a projection of a single column from the original table. By design, these projections are narrow (often a single integer or floating point field) and dense (the index of the tuple in the projection is its insert order into the corresponding table). With this layout on the disk, a column store can rapidly read all tuples for some column in order to select some interesting subset or aggregate some attribute. Once the data transfer rate approaches the maximum transfer rate for the disk, the new bottleneck becomes the CPU cache (for example, see Database Architecture Optimized for the new Bottleneck: Memory Access). As a result, column stores tend to use cache aware algorithms, such as the radix sort. Cache aware algorithms are designed to minimize the likelihood of a CPU cache miss, and thereby maximize the ability of the CPU to process the data coming off the disk. While column stores excel at analytic workloads, they are not well adapted for either online query (selective queries are much easier to answer with an index) or online updates (this poses challenges for the maintenance of the column projections). Ongoing research seeks to address these concerns.

The history of database architectures evolved from the initial debates between advocates for the relational (Cobbs) and network (CODASYL) database models. While there were many arguments concerning the merits of these approaches, including whether they support declarative query languages and decoupling of the logical schema from the physical schema, the issue was settled more by market forces than the merits of these arguments and the relational model became the dominant paradigm. Since that time, object-relational and XML databases have appeared and have been incorporated by various database vendors into their products.

During the last decade, XML databases, column stores, map/reduce, distributed file systems, cloud computing patterns, distributed cache fabrics, key-value stores, commodity hardware, GPS, GPUs, SSD, the semantic web, linked data, social networking, and a slew of other hardware, software and social developments have emerged and created a game changing landscape. The database is in the process of being deconstructed as a technology. This is a tremendous opportunity.

RDF (or graph) databases are relatively new and, like XML databases, pose new technical challenges. YARS modeled RDF using covering indices for the different access patterns. For triples, this is SPO, OSP, and POS. For quads, there are six such indices. This design works well for online query workloads. But, as I pointed out above, analytic workloads will saturate the disk access time when using a B+Tree index. However, there are different ways to get around this problem. YARS2 uses a combination of bulk data load, bulk sort, and ISAM files to allow fast sequential access to the data on the disk. MonetDB is examining a variety of solutions based on column projections. Map/reduce typically uses unindexed data and just brute forces its way, reading all data and pushing the bits selected by the query onto the reduce nodes. I outline the approach bigdata uses below.

Before jumping into the bigdata approach to unselective queries, let me point out that while RDF “tuples” are “narrow”, the physical schema for RDF data with covering indices still interleaves all tuples for the same logical row. Consider the SPO index: all predicate-object combinations are clustered together. This is not what a column projection does. One way to model RDF data in a column store is to use a physical schema with one column per predicate. Running down a column projection of a predicate would provide rapid access to all distinct object bindings for that predicate. However, this physical design can lead to significant problems for column store query optimizers. For example, see ColumnStore Support for RDF Data Management: not all swans are white. However, note that column store publications tend to study unselective queries and the existing studies do not compare the strengths and weaknesses of “native” triple stores when compared to RDBS and column stores for both unselective and selective queries.

Before we look at how bigdata handles unselective queries, let me provide some background on how bigdata manages its indices. When running on a single Journal, the backing persistence store for bigdata can be huge — terabytes of data can be stored in a single file. However, the scale-out architecture uses the Journal as an ACID write buffer, storing only the last 200MB of writes accepted by a given node. Once the current journal files up, a fast synchronous overflow operation atomically closes out that journal for writes and opens a new journal. Asynchronous overflow processing then transfers the data from the old journal onto index segment files using batch B+Tree builds. Each shard consists of the recently buffered writes on the life journal and zero or more index segment files from batch builds on old journals. When the shard view gets complex, a compacting merge will replace the view with a single index segment file on the disk. When that file gets to be 200MB on the disk, the shard will be split. In the scale-out architecture, we never have files larger than 200MB.

Index segment files are laid out optimally for either local or remote IO. The B+Tree nodes are in one region in total key order and are typically read into memory with a single IO when the index segment file is opened. Likewise, the B+Tree leaves are in another region of the file in total key order and are arranged into a double-linked list for sequential traversal. However, sequential traversal of the leaves using the double-linked list is significantly slower than a sustained IO designed to read the leaves directly into memory.

Bigdata handles unselective queries by reading the leaves for each shard view using a single IO per index segment in that shard view. That will range from 1 to 4 IOs per shard view, depending on how long it has been since the last time a compacting merge was done for that view, but no more than 200MB of data in any case for a shard. All told, this is just a few seconds of IO. An ordered B+Tree read over the same shard is at least an order of magnitude slower, and it is at least another order of magnitude if you are doing random B+Tree reads.

If the query is selective, then we navigate down the B+Tree and read just the leaves that we need. If the query is unselective, then we maximize the disk transfer rate. By using sustained IOs, this approach avoids the disk access time bottleneck and let’s us step through the data at the disk transfer rate. However, since the data access pattern is now radically different, we need to use different join operators as well. For example, many column stores use operator at a time rather than vectored pipelined joins. A few, such as VectorWise, use vectored joins. Look for some new join operators in bigdata over the next few months as we move to support analytic query workloads for RDF along with support for SPARQL 1.1, which includes aggregation operators.

Jun 072010
 

Online query workload queries touch a relatively small region of the data on the disk. Such queries are said to be selective. Selective queries can be answered very quickly if an appropriate index is available. The performance curve of the B+Tree is log-linear as a function of its depth. As the B+Tree grows deeper, each new layer of the B+Tree adds another disk seek and disk access latency bounds the performance of the index. Cache avoids some portion of the disk hits for recently used (LRU) or frequently used (LIRS) cache policies. Bigdata has a non-blocking LRU cache implementation now, and a non-blocking LIRS cache in the works. A non-blocking caches may be used to buffer the B+Tree nodes and leaves on the disk, to cache frequently used RDF Values, to cache query results, etc.

Once an online query mix workload becomes sufficiently varied that the read set of the database exceeds the cache, index reads begin to read through to the disk. At this point, the CPU utilization will fall off dramatically as cores are basically waiting around for records to be read off of the disk. Once this occurs, additional CPU activities, such as block compression, block encryption, etc., become basically “free” — their cost disappears into the disk latency with a concurrent online query mixture.

NIO made it possible to write Java applications that can handle 10,000 concurrent network connections using asynchronous network IO patterns. NIO2 will do for the disk what NIO did for network IO. Using asynchronous IO patterns (based on either Futures or callbacks), a single thread will be able to service the disk which will dramatically reduce the resource demand for heavy online query mixes. NIO2 is coming with Java7 and should offer tremendous benefits for the RWStore, which uses scattered and gathered IO, and for the vectored pipeline join algorithm.

When managing a large heap (4G+), the choice of the garbage collection policy becomes critical. Both the parallel old generation and the experimental G1 garbage collectors work well for bigdata with large heaps (G1 has better throughput, but is still experimental and crashes periodically so it can not be reliably deployed yet). However, if you chose the wrong garbage collector, the garbage collector can wind up as 80% of more of your CPU time! Right now, the safest choice is the parallel old generation garbage collector, which is enabled using -XX:+UseParallelOldGC on the Java command line.

As main memory heaps expand, this interaction of garbage collectors and application memory access profiles has a lot of implications for the ability of a Java application to utilize very large heaps without running into significant GC pauses. While the G1 collector promises to address this issue for many applications, another alternative is to explicitly manage the cache on the native heap using the same record management algorithms we use in the RWStore to manage allocation slots on the disk. However, accessing data on the native heap in a direct ByteBuffer is slower than accessing the data in a Java byte[], so the only reason to pursue this approach for the B+Tree node and leaf cache is to deploy a smarter cache, such as LIRS, which in insensitive to index scans and other patterns which cause problems for an LRU algorithm. To derive the most benefit from the LIRS cache, we then have to turn off the file system cache for that disk or file. Doing this is, of course, tremendously platform dependent.

SSD reduces the disk access time by as much as an order of magnitude when compared to traditional disk. When running on SSD with an online query workload, performance drops off much more slowly as the B+Tree begins to read through the cache to the disk. This translates directly into increased database performance and higher query throughput. Enterprise grade SSD is still relatively expensive, but it is cheaper than RAM and available in capacities of 1TB or more. (These arguments apply to the bigdata Journal — SSD has a different role in the bigdata federation, which I will cover in another article.)

In contrast to online query workloads, analytic query workloads are relatively unselective — that is, they have very large working sets on the disk. For such unselective queries, we are often better off reading all the data off the disk using sustained disk transfers that maximize the disk transfer rate rather than saturating the disk with random access reads. For my next subject, I will cover analytic query workloads and show how the bigdata federation handles both online query and analytic workloads in a single architecture.