One of the key elements to 23 Video is our analytics feature, which allows our customers to keep track of how their video sites and videos are performing. This isn’t just the “off the shelf” kind of analytics that you can get from great products like Google Analytics or Gauges (if you’re a fan of not having your eyes bleeding) though, as videos have quite a few more dimensions to them than just plain page views. We therefore have to track, store, process and present all of this ourselves. In theory this is actually very simple; you collect a lot of data points from any number of sources, chew on it ever so often, and spit out a working set of data that you can throw at the user.

Growing pains

Total storage size of analytics data
Total storage size of analytics data over time

As is always the case with theories though, it’s never quite as easy in real life (go prove string theory.) The biggest problem is the sheer amount of data we accumulate, which increases by some 500,000 data points per day. A week ago, the storage requirement hit 88 GB of raw data, and if you look at the chart above, you can probably tell where this is headed. So, how do you store this kind of data? Initially, we jumped on the apparent band wagon of success of other startups having used NoSQL databases for this kind of job and just poured the data into one big MongoDB instance on a server with quite a sizable amount of RAM. This had a few upsides, actually. Instead of having to group each collected piece of data while processing, we could apply a bit of logic and actually build far more representative data points which would not only be faster to process but also fit the logical structure of our analytics systems very well.

This worked well for a period of time, but then we hit the wall. In order to do anything meaningful to data in any database, you of course have to have indexes, which in MongoDB tend to take up a violent amount of space. So, the next step was to start separating data. Data is logically grouped by time, so after a day or so, a specific data point is never mutated again. The logical approach was to separate the data into a “hot” and a “cold” data set, which also allowed us to save quite a lot of RAM on indexes as the cold data set didn’t need all the same indexes.

So far so good. We were back at full speed for some time again. Well, that is until a week ago, when I wanted to actually do something useful with the “cold” data set to enrich some of the denormalized data with a few more parameters. This turned out to be virtually impossible to do. The indexes had grown to an absurd size, and even with a sizable amount of RAM, MongoDB was spending all its time faulting document pages in and out of memory due to their sheer storage size, so I was getting literally nowhere. Add to that, the disk and backup size requirements were going mental at the aforementioned accelerated rate, so something had to be done. Luckily, however, a few things played in our favor this time. It had quite a long time ago become apparent, that using map/reduce for creating the denormalized output simply wasn’t viable — reading the data points we needed for a specific denormalized value from the database and working on it in a simple Python application was faster, and not only by a slight margin. This essentially meant that our entire processing architecture was built around working on the raw data points directly, so really, if we grouped the data in the same way as we would normally be querying to get it from either the hot or cold MongoDB servers, we’d be able to chew it no matter where it actually came from. So, I wound up dumping all the cold data to simple, flat files containing JSON data grouped by the same parameters as we use for querying. A few hours of writing a piece of code, that would stream that data from disk to the processing application, and all was well and jolly.

Well, almost. We could now chew data again at full pelt, but it didn’t solve one of the biggest problems; the size of the data set. I’ve previously discussed a problem, which is inherent to document databases in this regard: keys. For every single document, you wind up storing the keys for each value. Hacky approaches like using completely useless one character keys to minimize this problem have been some of the remedies I’ve had to employ to work around this, but it’s not beautiful at all, and really ruins the point of document databases. Steffen proposed an overly simplistic solution to this problem, though. As we’re storing everything in flat files, we can basically do with them, what the hell we want. So, why not compress them? Half a day and countless CPU cycles spent on bzip -9 later, the result showed; we’d compressed the data set from 88 GB to a mere 3.3 GB of on-disk storage size, or in other words, we’d compressed it by a ratio of roughly 26.7:1 on average. The compression was of course greater the more similar values were present in each file, but even files with a lot of differing values showed great results, which can to quite an extent be attributed to one simple thing: compression of keys.

Unstructured data is kinda structured

While the above mentioned approach works, we can all agree that it’s not really that beautiful at all. Consider the case where using map/reduce would have been considerably faster than chewing the data points one by one. Simply dumping out the data like that would mean that we would probably have to either import it back into a temporary database if we needed to work on it, or, even worse, maintain two different code bases to do the exact same job with different approaches, neither of which are good solutions. In this scenario then, it would still have been best to keep the data in the database, where it sort of belongs, but we need to do something about the storage size if we don’t want to just keep throwing iron at the problem.

The compression rate above is a testament to the fact that even though our document databases are filled with all sorts of different documents, they’re still products of a structured approach; programming. Logically then, under most circumstances, the keys in a collection of documents will not differ by a whole lot. This may of course not be true in all cases, but for mostly any case I’ve seen, it holds, which gives us a great advantage.

With this assumption in mind, we can make yet another assumption; any given document structure is likely to appear more than once, so constructing a schema from any given document should represent a reasonable overhead compared to the storage costs. Consider the following document showing the results of some poor gardeners counting the day’s output of different fruits in a plantation:

{
    "date": ISODate("2011-05-22T00:00:00Z"),
    "apples": 10239,
    "oranges": 399,
    "bananas": 9955
}

Now, at some point, the plantation hires a kid who went to business school. This kid produces a series of PowerPoint charts showing, that they could make more money, if they also started producing kiwis. So, from that day on, the document changes:

{
    "date": ISODate("2011-09-21T00:00:00Z"),
    "apples": 11239,
    "oranges": 772,
    "bananas": 7561,
    "kiwis": 9982
}

What we wind up with over time then, is a collection of documents following either one of two schemas. Any document contains either 4 keys (date, apples, oranges, and bananas) or 5 keys (date, apples, oranges, bananas, and kiwis). In our database then, we can keep two schemas stored internally, which could look something like the following:

[{
    "id": 1,
    "keys": ["apples", "bananas", "date", "oranges"]
}, {
    "id": 2,
    "keys": ["apples", "bananas", "date", "oranges", "kiwis"]
}]

When storing a document on disk we could then simply store them with a schema ID followed by the values in a sequential order, depending on how strict we want to make the schemas. Most of the time, the type of a value is pretty constant as well, so you could go as far as to define a storage type of each key in your schema along with a byte offset allowing you to do blazingly fast access to the values themselves, because we can completely ignore any parsing of document structure as we already have a schema defining it.

For the first of the two schemas, MongoDB uses somewhere in the vicinity of 65 bytes excluding the unique document ID, but knowing that we could currently fit the date in for example a 64 bit type and each of the counts in something as small as a 16 bit type, we could cut the storage size down to some 18 bytes if we consider a schema ID being a 32 bit type for safe running:

[unsigned 32-bit integer]    Schema ID
[signed 16-bit integer]      "apples"
[signed 16-bit integer]      "bananas"
[unsigned 64-bit integer]    "date"
[signed 16-bit integer]      "oranges"

That’s quite a difference, especially if you’re storing millions of documents of similar schemas — actually, it’s 44.2 MB per million documents, so in the case of more complex documents, you’d be able to fit quite a lot more data into that precious RAM of yours.

Prototypes and compilation

Documents are of course not constant, so you would need schemas to describe different mutations over time. Not all mutations change the schema of a document, though. Incrementing a value could require a larger type for storage, but most of the time, that’s not the case, which means that such operations could be made extremely fast. But, adding a new key to a document does of course incur a mutation of the schema. The simplest way of handling this is by simply checking if we have another schema that matches the new document structure and copy over the values and be done with it.

But, since accessing values in this structure should theoretically be a whole lot faster, we can take all of this to the next level by taking a peek at one of our neighbors, who’s made an art out of mutating dynamic structures; the Google V8 JavaScript engine. The guys behind V8 managed to make JavaScript execution blazingly fast by internally constructing so called “hidden classes” from JavaScript objects, where changes to the structure results in the object simply mutating from one hidden class to another. Furthermore, V8 uses these hidden classes as safe guards to compiling JavaScript down to machine code to execute code as fast as humanly possible. These are essentially what I would consider “object prototypes,” and one of their greatest benefits is that they can share machine code between a lot of these prototypes for the cases where only a subset of the properties are accessed.

Now, documents of course do not contain executable code, but if we add another assumption, following an extended prototype based model starts making a lot of sense. Given that document schemas are either identical or considerably similar, we can also assume that a lot of the queries we’re going to be performing against these documents are identical. For example, let’s consider that we have an UnQL query counting the number of days where more than 10,000 apples were produced:

SELECT COUNT(*) FROM production WHERE production.apples > 10000

Using a simple query planner, we should be able to boil this down to some very efficient code, but the most efficient way to execute this query would of course be to compile it down to machine code. While this imposes an overhead, the assumption that queries are often not that different, means that we can accept this overhead for the same reasons we can accept the transformation overhead between documents and schemas. Given that we only access one key in the document, we should actually be able to reuse the same machine code for both the schemas defined above, as they both contain the apples key. However, if we want to do direct memory access, we’re required to ensure that apples exist at the same offset in both schemas, which is where the V8 methodology of tracking prototypes really comes in handy. If we keep track of prototypes, we can easily structure the new prototypes in such a way, that data is accessible at the same offset for different yet similar schemas. The end result? Well, for a collection containing only documents of the two example schemas, or “prototypes,” we should only need to compile the query down to one piece of machine code, and if we decide to add another prototype which has, say, a mangos key, we’ll still be able to reuse it and thus keep our CPU cache very happy. Neat, huh?

Well, there’s actually even more to it. Consider that we want to execute the same query but for kiwis instead, it would look something like:

SELECT COUNT(*) FROM production WHERE production.kiwis > 10000

Once again, we can construct a piece of machine code, that efficiently executes this query, but we will only be able to apply this to one of the two schemas, as the first schema doesn’t contain the kiwis key. So, we’ll actually be able to throw away a lot of documents from the get go and gain some performance from this, while the addition of the mangos key won’t ruin our query either.

Silver bullets

What I’ve described here is an outline of a possible solution, that could solve some of the problems that document databases face these days, most notably the problem of the storage size. I’m however not saying that this is a silver bullet by any margin, and it won’t substitute greater architectural traits like having a good model for partitioning large amounts of data. It could quite possibly help make scaling these kinds of databases even better, but for that to be said with certainty, well, it would have to be implemented and tested.

I previously mentioned the idea of writing a document database, and while I’m not a fan of the whole “NoSQL is a silver bullet” hype, document databases still have their place in solving some problems which are harder to solve using traditional databases. Hadoop is a pretty clear evidence of this, although, in my honest opinion, it does suffer from a few other problems such as an ever growing amount of abstraction layers built to avoid tackling problems head on.

I often go through phases of not sleeping much, so yesterday night I started implementing some of the ideas outlined in this post, and maybe some day that will turn into a database. But, for this to ever become a reality, I’d love some input from people who’ve actually used document databases at a larger scale, so that I don’t make all the same mistakes as have already been made. So, if this includes you, I’d love to hear from you especially with regards to how you’ve structured your data, what queries you perform and how, and of course what scaling problems you’ve faced and how you’ve solved them.