Apr 152009
 

I have finished an implementation of a stream-based approach for writing on the scale-out indices. This does all the right things in terms of deferring RMI (index partition writes) until it has a good-sized chunk of data and having only a single outstanding RMI per client per index partition. I need to test this and then integrate it into the RDF bulk data loader in a few places and then I can start collecting performance data on it.

I think that the TERM2ID index will still need to use the synchronous RPC index writes since we need to have the assigned term identifiers before we can write on the rest of the indices. Likewise, if statement identifiers are enabled, then there will be another synchronous RPC (which is also on the TERM2ID index) to obtain the statement identifiers. Once we are done with the synchronous RPC on the TERM2ID index, the terms and statements can just be written onto an asynchronous sink for the ID2TERM, TEXT (full text lookup), and SPO, POS, and OSP indices. When the bulk load finishes, it will just await the Future for those asynchronous sinks.

More when I know more.