Flask & Mongo for Wedding RSVPs

07 Feb 2015 »

One of the few things I’ve been able to help my fiancée with during wedding planning has been the wedding website.

Being a somewhat technological couple, we decided to forgo paper RSVP response cards and instead ask that our guests respond online. Some services like RSVPify exist for just this purpose, but what’s the fun in that. Rolling my own allowed for very tight coupling with the user experience of our wedding website.

The github repo has the full flask application. Currently, some summary statistics of the current rsvp status are supported. The static content of the website is all hosted with Github Pages and the RSVP functionality interacts with a REST interface defined in Flask hosted by DigitalOcean.

I found a nifty shell script by Andrea Fabrizi called dropbox-uploader.sh which, as the name implies, uses the Dropbox API to upload files given as arguments. Using this, I was able to automate mongodb backups of the latest RSVP data with the following


DATE=$(date +%Y-%m-%d-%H-%M)

pushd /tmp
mongodump -d jr_web -o jr-mongo-$DATE
tar pczf jr-mongo-$DATE.tar.gz jr-mongo-$DATE
rm -rf jr-mongo-$DATE
/home/db/dropbox_uploader.sh upload jr-mongo-$DATE.tar.gz jr-mongo-$DATE.tar.gz
rm -f jr-mongo-$DATE.tar.gz

With this, I simply use cron to upload a backup every few hours. This allows me to test in production and not worry too much about making any fatal mistakes.

Random Projections for Large-Scale Speaker Search

20 Oct 2014 »

Please check out my latest paper, published in the Proceedings of Interspeech 2014. The full paper is linked below.

This paper describes a system for indexing acoustic feature vectors for large-scale speaker search using random projections. Given one or more target feature vectors, large-scale speaker search enables returning similar vectors (in a nearest-neighbors fashion) in sublinear time. The speaker feature space is comprised of i-vectors, derived from Gaussian Mixture Model supervectors. The index and search algorithm is derived from locality sensitive hashing with novel approaches in neighboring bin approximation for improving the miss rate at specified false alarm thresholds. The distance metric for determining the similarity between vectors is the cosine distance. This approach significantly reduced the search space by 70% with minimal increase in miss rate. When combined with further dimensionality reduction, a reduction of the search space by over 90% is also possible. All experiments are based on the NIST SRE 2010 evaluation.

See the full paper and reference.

Using this work? Send me your ideas ideas for use cases and improvements.

Conditional Mutations in Accumulo 1.6.0

28 May 2014 »

With the release of Apache Accumulo 1.6.0, an extremely versatile new feature was added: Conditional Mutations. This new style of mutation enables atomically altering a row based on the current state of the row. One or more conditions may be set on the existence of columns or equality of value for a given column. This functionality may be extended by specifying iterators to be applied prior to evaluating the conditions.

Use Case

A common use case that could take advantage of Conditional Mutations is indexing a new domain object for an application into Accumulo. Imagine an online indexing scenario, where a ‘document’ with a unique id, miscellaneous data fields, and visibilities should be added to a table. In our API, we state that if the user tries to index a document with a UID that already exists, an exception will be thrown.

In Previous Versions

In previous versions of Accumulo, the indexing code would first have to create a scanner over the document UID’s range and attempt to iterate. If any key/value pairs are returned, the document already exists, and an exception should be thrown. If not, indexing using a traditional BatchWriter may proceed.

There are two issues with this approach:

  1. Scanning is slow. When doing high-volume indexing, it is not efficient to be constantly launching Scanners to check only for the existence of a key/value pair. The overhead of creating the scanner and network traffic dramatically reduces ingest throughput.
  2. Scanning then Writing is not atomic. Assuming multiple threads or completely separate indexer tasks running in different JVMs, it’s possible that another process may create the same document in between the scan and write steps. This, however, may be worked around by implementing a distributed locking mechanism (using, for example, Zookeeper).

Despite both of the above issues being surmountable, it is highly desirable to have server-executed atomic operations at the row-level conditioned on the current state of the row.

In Accumulo 1.6.0

The introduction of the ConditionalWriter and its associated ConditionalMutation and Condition objects, enables one or more Condition be met for a ConditionalMutation to be applied. Conditions check for the absence or equality of a column.

In the previously described use case, it is trivial to use this new functionality to implement the desired functionality. Assume that a Document is indexed with it’s unique id as the row key. There are two column families, one is 'meta’ for which there may be any number of metadata key/value pairs, and a 'docId’ column family which simply repeats the document id in the column qualifier, as follows:

[ row         [ cf    : cq    : cv    ]]   value
[ 12345       [ docId : 12345 : A&B&C ]]   ''
[ 12345       [ meta  : meta1 : A&B&C ]]   'v1'
[ 12345       [ meta  : meta2 : A&B&C ]]   'v2'
[ 12345       [ meta  : meta3 : A&B&C ]]   'v3'

In this example, we can condition the creation of a row (and therefore the document) on the non-existence of the 'docId’ key (i.e. [ 12345 [ docId : 12345 : A&B&C ]] ''). When a Condition’s setValue() method has been called, the condition passes if and only if that column exists and it’s value is equal to the value set by setValue(). If setValue() is not called, the Condition passes if the column specified does not exist.

The complexity and utility of the equality and existence checks may be increased by configuring scan-time iterators used solely for the condition evaluation. When checking for equality, for example, an iterator may be defined such that values less than some number map to 0, and others map to 1. This iterator could now be used to apply a mutation if and only if the value of a given column in the row is less than or greater than some arbitrary value.

Continuing our indexing example, we may not know a priori what the visibility of any (potential) pre-existing document may be. To account for this, a custom iterator is supplied which transforms all keys’ column visibility to the default visibility by extending TransformingIterator. See this gist for an example implementation of the conditional writer enforcing unique documents.


The new Conditional Mutation feature introduced in Accumulo 1.6.0 resolves two major indexing issues in practical, high-volume Accumulo applications. By leveraging the feature, mutations can be checked for uniqueness both atomically and quickly, which was difficult or impossible in previous versions of Accumulo. Further, the custom iterator stack for evaluating conditions enables complex conditions be satisfied prior to executing a mutation on a row. This feature is paramount to the current development of Accismus, based on Percolator, a novel system for large-scale incremental processing. Many other opportunities exist for leveraging this feature.

Send me your ideas for Conditional Mutation uses on twitter for a follow-up tutorial.

Branch Prediction and Why Clever Isn't Always Better

05 Jan 2014 »

Recently I was implementing a research prototype for locality-sensitive hashing using Python with NumPy. As part of this work, I had a need for a fairly simple function to create a bitmask. The one catch, however, is that the function would need to be called billions of times in a loop. It quickly became apparent that a Python implementation would be far too slow (and why should it be fast, Python isn’t designed for bit twiddling). Fortunately, it’s pretty easy to extend Python (and NumPy!) by writing a native library (more on that in another post). So I set off to write this function in C.


The function, get_mask, would take a bitfield represented as an n-bit integer, index, and an n-length array of integers, order, as arguments. The order array represents a permutation of the bitfield represented by the input integer, index. The output of the function would be the index’s bits shuffled by the order specified in order. For example:

Imagine in this toy example, our integers are only four bits (for illustration purposes). We call get_mask(5, [1,3,2,0]) and return 6. To explain, we look at the binary representation of index, in this case 0b0101. A new integer, out, is initialized to zero to represent our output.

Since the zero'th and second bits of index are 1 (where bit 0 represents the LSB), we look at the zero'th and second elements in the array order. The values in order, then, are which bits in the output to turn on. We can then say that the output = (1 << order[0]) | (1 << order[2]).

The algorithm, in C, follows:

unsigned long long get_mask(unsigned long long index, int* order) {
  unsigned long long out = 0;
  int j = 0;
  while (index > 0) {
    if (index & 0x1) {
      out |= 1 << order[j];
    j += 1;
    index >>= 1;
  return out;


Running the above C implementation of get_mask produced marked improvements over the Python implementation. However, profiling revealed a significant amount of time was still being spent in the execution of get_mask. It seemed likely we could do better.

When I wrote the get_mask, I thought if (index & 1) would be clever. Anytime index & 1 evaluated false, order+j would not have to be computed, nor dereferenced (1 instruction), no bit shift would occur (1 instruction), and no logical or would occur (1 instruction). For a savings of somewhere on the order of 3 clocks every time index & 1 evaluated false. If, on average, half of the bits in a 64-bit integer are 0, this conditional should save about 96 clock cycles per call.

Unfortunately, I failed to take into account the branch predictor and instruction pipelining. To keep their pipelines full, CPUs fetch instructions from the near future of the computation, which requires guessing which way conditionals are going to branch. If the guess is incorrect, the CPU must dump out its pipeline and start over at the place where it guessed wrong. On recent Intel chips, the penalty is about 15 cycles.

In the for loop above, it’s essentially impossible to reliably predict how the conditional will be evaluated. The end result is that the conditional is mispredicted (average case) about 50% of the time. When dealing with 64-bit integers, the loop carries out 64 iterations, causing the pipeline to be flushed 32 times and costing 15 clock cycles each time or, on average, about 480 cycles.

Remember, we calculated earlier that not executing out |= 1 << order[j] when not needed would save 96 cycles. It turns out that the branch mispredictions end up costing nearly 400 more cycles than the conditional saves!

Fortunately, it is easy to conceive an equivalent function that has no unpredictable conditional inside its loop: if (index & 0x1) out |= 1 << order[j]; is replaced with out |= (index & 1) << order[j]. When index & 1 is 0, a zero is shifted and or’d, which will have no effect on the output. Exactly as we wanted!


unsigned long long get_mask(unsigned long long index, int* order) {
  unsigned long long out = 0;
  int j = 0;
  while (index > 0) {
    out |= (index & 1) << order[j++];
    index >>= 1;
  return out;

Ultimately, there are more optimizations that can be done. To avoid the function call overhead, we can tell GCC to integrate the function code into the caller’s code by static __attribute__((always_inline)). In my case, the millions of get_mask calls typically occur for the same order and incrementing index values allowing me to use a lookup table (initially generated by the above function) and some other caching strategies as well. But that’s not what this article is about.

The takeaway here should be to always consider any branching in your code carefully: particularly when the branching occurs within a tight loop. If the evaluation of the condition is likely to change during the course of the iteration, there will more than likely be expensive branch mispredictions causing the CPU to rebuild its pipeline. Try to avoid this whenever possible, even if the code required to do so is “slightly less efficient.”

New Blog

03 Jan 2014 »

When I search the Internet for technical information – for a project, school, or work – more often than not, I come across a blog post on something related to what I was interested in finding. Very often, I find informative pieces of information by people who have had the same problem and then blogged about their solution.

Previously, blogging seemed like too much of a hassle. Setting up a Wordpress blog and maintaining it seemed like a lot of work for little reward. Github pages, however, lets me generate a static site while writing my content in Markdown. Even better, there are no databases to deal with and everything is automatically backed up and version controlled.

So, with that said, when I work through interesting problems or projects, I’m going to write about it. And just maybe someone will find that useful.