First off, I’m a big fan of the NoSQL movement. Not only because I think that SQL is usually overkill for most applications and syntactically overly complex, but also because I really like the ease of scaling that a lot of the NoSQL databases like my personal favorite, MongoDB, has introduced. To be fair, “NoSQL” is quite an ambiguous term and should in most cases be substituted with the term “document storage” or “document database” as they are essentially storage and query systems for mutable, schema less documents.

I’ve recently had the joy of actually using some of these new possibilities in production while developing the New Analytics engine for 23 Video. Initially, I was amazed. Using the official Python driver for MongoDB made it a charm to throw some objects at the database and pull them back out based on simple, logical queries that by far killed traditional SQL databases in terms of both readability and speed. Further more, it didn’t seem to have the trouble managing large amounts of data the same way that I’ve previously experienced it with MySQL and PostgreSQL. So far, so good, then.

Trouble in Paradise

One thing to understand about document databases though is, that you only get out what you put in. So, if you want to do aggregation queries like simple sums of values, things get a bit more messy. In fact, the only way to do this is through map/reduce methods, which is, in my opinion, a bit overkill. Map/reduce was initially designed and developed for something entirely different than simple summations of numbers. It’s a great approach for hardcore analysis of data, but anything below that level of complexity, it’s overkill. Sure, there are some simpler functions available in MongoDB for when you’re running single server configurations, but as MongoDB was more or less designed to be a horizontally scalable database server, these pretty quickly become useless and all you’re left with is, well, map/reduce.

It gets worse, though. The performance of map/reduce in MongoDB is not what you’d hope for, to say the least, and I often found myself better off actually transferring all of the documents I was summing over the wire to the application and doing summations there, with great performance increases as result. This is expected, though, due to the nature of map/reduce, but as this is a trait shared by most if not all NoSQL databases, it does mean, that they’re not entirely fit for the world of Web as it is right now. Be honest; how often do you actually find yourself scanning tens of thousands of documents for the word “cheese” in a real time setting? If you’re a relatively sane person, I’d guess your answer is somewhere in the vicinity of “not so often.”

There are other issues, too. By design, document databases don’t have a schema, which means that this has to be determined at run time every single time a document is accessed. While the schema less approach has some upsides, in that data is completely mutable and you’re basically free to throw whatever data at it you want, I don’t really feel like this matches real world requirements as well. Yes, in the elaborate application where all data is organic and angels are actually bashing their wings every single time you do a request, sure, schema less databases make sense. In the real world, you always have a schema, if you’re a sane person. Think about it; your application will, regardless of how dynamic the schema of your database is, have a specific approach to dealing with the data it throws at and gets from the database, which means that you inevitably wind up with a schema by design. Problems arise in several places as a result of this “design flaw”: first of all, you basically always have to be sure of what the data you’re dealing with is. If you’re not sure and don’t assert it, you’re going to wind up with broken map/reduces, which isn’t good at all. Oh yea, and while it’s neat that we can fluently add new properties to our documents without versioning our database, it introduces an entirely different complexity in that we constantly have to assert whether we’re dealing with new or old data like so:

function () {
  if (this.newPropertyA != undefined) {
    ..
  }

  if (this.newPropertyB != undefined) {
    ..
  }
}

That’s ugly. And a waste of compute cycles, to be honest, given that we’re probably going for using NoSQL databases for scalability. Wasting computing time doesn’t scale.

Although this flaw doesn’t break the usage of NoSQL in large scale environments, another trait does, in my honest opinion: storage. Properties have to have names, whether it’s in a traditional SQL database as columns or in document databases as document properties, but as documents are completely mutable, we have to store this name every single time. It may not seem like a big deal at first, but remember that disk I/O round trips are extremely expensive which means that for a database to be fast, we have to have everything available in RAM at any given time, so that the furthest we have to go is to some address in the RAM and not a disk, or even worse, a NFS or the likes. At scale then, this means that our RAM becomes our most precious resource and in an ideal situation, we’d love for our database size to be able to fit within the RAM we’ve bought. So what’s the problem? Consider the following scenario - we want to store information about the count of 8 different fruits contained in a basket represented as 32 bit integer values in a document resulting in a schema along the lines of (yes, there is a high prevalence of kiwis):

{
  apples: 10,
  oranges: 2,
  bananas: 3,
  kiwis: 599,
  avocados: 3,
  limes: 69,
  pears: 5,
  lemons: 4
}

For the sake of the argument, I’ll ignore the variable identifiers in the BSON specification, as they actually don’t make things any better or easier to understand. If we look at the above object, it consists of two simple things then: strings representing the name of the fruit, and integers representing the amount. A 32 bit integer takes up just that: 32 bits or 4 bytes. So, the storage cost of the actual values is 32 bytes. Given a simple ASCII character set and encoding, the names of all the fruits add up to a total of 49 bytes. In other words, our entire document takes up 81 bytes with 60 % of that being property names. That’s a lot of waste, if you ask me. Imagine throwing 10 million baskets into a database, and you’ve got a storage size of some 772 MB, with 467 MB being pure names. In the real world, the actual storage size of the document above is actually 152 bytes, which makes the case even worse. A hack is to reduce the names to single characters sizes, although that completely eliminates any readability. In most cases, this can be handled by the driver through mapping functions, although it is my experience that these mapping functions never seem to cover the entire spectrum of methods, which means that you’ll be losing your hair every single time you want to do something even a little complex without having to resort to remembering what single character synonym you assigned “apples.” For people with tall hair lines, this can be a massive problem.

A proposition

While working on a project to be released open source some time in the future, Palisade, I had a thought on these problems, as I was essentially designing and developing a statically typed object storage engine looking at it from the outside. So, considering the fact that I’ve already posed the statement, that we always have a schema, I think that a lot of the issues with what can somewhat be called “traditional NoSQL databases” can be solved through trading dynamically typed documents with statically typed objects. While this may seem very close to the approach taken by SQL at first, let me be clear on the fact that, I still want the flexibility in sub-documents (sub-objects) and arrays, just in a statically typed fashion, so that I always know, what I’m dealing with at run time. This solves first of all the performance problems in constantly having to evaluate the structure of a document, while also limiting storage size to something very close to that of the actual value. An approach for defining an object could be like so, written in pseudo C# for the sake of argument:

object Basket
{
    int Apples;
    int Oranges;
    int Bananas;
    int Kiwis;
    int Avocados;
    int Limes;
    int Pears;
    int Lemons;
}

Collection<Basket> Baskets;

It would be possible to pack this to exactly 32 bytes thus improving storage efficiency by a factor of 4.75 over the MongoDB approach. Furthermore, it would allow us to do queries in a completely different way, which would be guaranteed to succeed as we know both the available properties and their types at compile time. This would of course imply pre compiling the actual queries, but given that the SQL world has long used stored procedures as an approach to increasing performance, this doesn’t seem like a massive issue to me. Consider querying for baskets that hold more than 10 apples in LINQ style code as follows:

Basket[] HasManyApples()
{
    return Baskets.Where(basket => basket.Apples > 10);
}

Now, that would be easy. And fast. Let’s consider another scenario: we want to get the e-mail addresses of users, that have a specific string flags set, and get the number of users that don’t have the flag. As we’re statically typing everything, we’re going to have to define three things: an object type, a result type and a query function. It would look something along the lines of:

object User
{
    string Name;
    string Email;
    string[] Flags;
}

Collection<User> Users;

object NewsletterRecipients
{
    // The e-mail addresses of recipients
    string[] RecipientEmails;

    // The number of users that aren't recipients
    uint64 NonRecipients;
}

NewsletterRecipients GetNewsletterRecipients()
{
    return new NewsletterRecipients
    {
        RecipientEmails = Users.Where(user => user.Flags.Contains("newsletterRecipient"))
                               .Many(user => user.Email),
        NonRecipients = Users.Count(user => !user.Flags.Contains("newsletterRecipient"))
    };
}

Now, you’d also want some sort of map/reduce functionality for those really hard or difficult tasks, but it should by no means be the Swiss army knife it’s made out to be in databases like MongoDB. In terms of wire protocols, the result would logically be transferred using some sort of packaging along the lines of BSON or MsgPack, so that a driver doesn’t have to naively assume a certain object structure, but the overhead in this is minimal considering the fact, that we would be reducing the amount of data to be sent on the server side of things.

Now, this proposal isn’t perfect at all. There are things to be considered like compilation strategy, versioning and upgrading data sets, but I think it seems like a step in the right direction in terms of scalable performance. It’s probably not an endeavor I’m going to venture into any time soon, so for the time being, this is just a set of thoughts on an approach to evolving the NoSQL databases to the next step in a fashion that is far more compatible with the requirements set out by especially Web applications.