With the release of Apache Accumulo 1.6.0, an extremely versatile new feature was added: Conditional Mutations. This new style of mutation enables atomically altering a row based on the current state of the row. One or more conditions may be set on the existence of columns or equality of value for a given column. This functionality may be extended by specifying iterators to be applied prior to evaluating the conditions.
A common use case that could take advantage of Conditional Mutations is indexing a new domain object for an application into Accumulo. Imagine an online indexing scenario, where a ‘document’ with a unique id, miscellaneous data fields, and visibilities should be added to a table. In our API, we state that if the user tries to index a document with a UID that already exists, an exception will be thrown.
In Previous Versions
In previous versions of Accumulo, the indexing code would first have to create a scanner over the document UID’s range and attempt to iterate. If any key/value pairs are returned, the document already exists, and an exception should be thrown. If not, indexing using a traditional BatchWriter may proceed.
There are two issues with this approach:
- Scanning is slow. When doing high-volume indexing, it is not efficient to be constantly launching Scanners to check only for the existence of a key/value pair. The overhead of creating the scanner and network traffic dramatically reduces ingest throughput.
- Scanning then Writing is not atomic. Assuming multiple threads or completely separate indexer tasks running in different JVMs, it’s possible that another process may create the same document in between the scan and write steps. This, however, may be worked around by implementing a distributed locking mechanism (using, for example, Zookeeper).
Despite both of the above issues being surmountable, it is highly desirable to have server-executed atomic operations at the row-level conditioned on the current state of the row.
In Accumulo 1.6.0
The introduction of the ConditionalWriter and its associated ConditionalMutation and Condition objects, enables one or more Condition be met for a ConditionalMutation to be applied. Conditions check for the absence or equality of a column.
In the previously described use case, it is trivial to use this new functionality to implement the desired functionality. Assume that a Document is indexed with it’s unique id as the row key. There are two column families, one is 'meta’ for which there may be any number of metadata key/value pairs, and a 'docId’ column family which simply repeats the document id in the column qualifier, as follows:
[ row [ cf : cq : cv ]] value [ 12345 [ docId : 12345 : A&B&C ]] '' [ 12345 [ meta : meta1 : A&B&C ]] 'v1' [ 12345 [ meta : meta2 : A&B&C ]] 'v2' [ 12345 [ meta : meta3 : A&B&C ]] 'v3'
In this example, we can condition the creation of a row (and therefore the document) on the non-existence of the 'docId’ key (i.e.
[ 12345 [ docId : 12345 : A&B&C ]] ''). When a Condition’s setValue() method has been called, the condition passes if and only if that column exists and it’s value is equal to the value set by setValue(). If setValue() is not called, the Condition passes if the column specified does not exist.
The complexity and utility of the equality and existence checks may be increased by configuring scan-time iterators used solely for the condition evaluation. When checking for equality, for example, an iterator may be defined such that values less than some number map to 0, and others map to 1. This iterator could now be used to apply a mutation if and only if the value of a given column in the row is less than or greater than some arbitrary value.
Continuing our indexing example, we may not know a priori what the visibility of any (potential) pre-existing document may be. To account for this, a custom iterator is supplied which transforms all keys’ column visibility to the default visibility by extending TransformingIterator. See this gist for an example implementation of the conditional writer enforcing unique documents.
The new Conditional Mutation feature introduced in Accumulo 1.6.0 resolves two major indexing issues in practical, high-volume Accumulo applications. By leveraging the feature, mutations can be checked for uniqueness both atomically and quickly, which was difficult or impossible in previous versions of Accumulo. Further, the custom iterator stack for evaluating conditions enables complex conditions be satisfied prior to executing a mutation on a row. This feature is paramount to the current development of Accismus, based on Percolator, a novel system for large-scale incremental processing. Many other opportunities exist for leveraging this feature.
Send me your ideas for Conditional Mutation uses on twitter for a follow-up tutorial.Tweet