It’s been a while since I last talked about the whole NoSQL space. Back then I lashed out on the lack of proper querying abilities in the current state of NoSQL databases, proposing a LINQ-style approach to stored objects and methods essentially resulting in an object database rather than a document database. Since then, my stance on NoSQL databases has changed quite a bit; or, rather, my stance on the NoSQL databases that are currently available has changed.

MongoDB has made it to the top of my shit list in recent months. Not because of its inability to do even simple aggregate queries in an efficient manner. No, it’s because the database itself is terrible. Fragmentation, random crashes and the various side effects of the ridiculous notion of a global locking mechanism are common disturbances when running MongoDB in production — for a period, version 2.0.1 instances would crash on a daily basis for no apparent reason. Nights spent trying to solve this and recover data from these events are simply not worth it, no matter how much “the future” this database is.

The other two big offerings in the market, CouchDB and Cassandra are equally unimpressive at a glance, although, granted, I do not have enough production experience with either of them to fully compare them to MongoDB. CouchDB is however quite silly in my opinion. By now, I’d say that it’s a well known fact, that HTTP isn’t exactly the Concorde of protocols — a text based protocol for a database no matter how distributed just does not make sense, least of all as general purpose a protocol as HTTP. Cassandra seems to be somewhat better than both MongoDB and CouchDB in some regards, but at the same time it seems like it has a bit of an identity crisis to work on. I’m not even going to consider neither Redis nor membase “databases” in this context, as they’re just simple key-value stores.

So, besides frantically fighting MongoDB at 23, I’ve let the NoSQL space sit for a while and “just” use plain old MySQL and PostgreSQL databases for most tasks. Today, however, a post, “The Future of CouchDB”, by the original author of CouchDB, Damien Katz, caught my attention after sifting through the Hacker News “rate my startup” spam stream. In the post, Katz talks about his move from developing CouchDB to basically completely rewriting the memcache/membase backed Couchbase server to a version 2.0. Besides the interesting points on throwing out Erlang in favor of C and C++ for raw performance, one detail has potential to become very important, if “I’m dead serious about making it the easiest, fastest and most reliable NoSQL database” isn’t just empty promises, UnQL.

The Unstructured Query Language

UnQL is essentially the NoSQL counterpart to SQL. While SQL is often referred to as Structured Query Language, it’s only fitting that UnQL is an abbreviation of Unstructured Query Language. Like SQL, UnQL is an attempt at standardizing the querying of unstructured stores such as key-value and document stores led by Damien Katz and the creator of sqlite, Richard Hipp. UnQL and SQL share a lot more than naming and intent, though; the language itself is in many ways very similar to SQL too.

For example, selecting the field content from a collection, posts, would be done through

SELECT { content: posts.content } FROM posts;

resulting in a series of documents like

{"content":"The post content"}
{"content":"Another post"}
..

While this is not revolutionary compared to what MongoDB, CouchDB and Cassandra can do, the fun stuff begins with the possibility of doing very SQL-like aggregation queries using constructs like GROUP BY, UNION, and FLATTEN. The specification is however far from finished, so some things are still very unclear. Theoretically, however, getting the ID’s of users who have created posts within a certain time period along with the number of posts, should be possible through a query not too dissimilar to

SELECT { user_id: posts.user_id, count: COUNT(posts) } FROM posts GROUP BY posts.user_id

In databases like MongoDB, you have no other choice than to map/reduce your way out of this specific problem, and as you add more query parameters things wind up pretty ugly. JOIN like aggregation requires multiple passes when using map/reduce, but is offered implicitly in UnQL when selecting across multiple collections.

Other more NoSQL-esque operations like as “upserts” are also far more flexible and structured in UnQL, which is also a welcomed addition. The example queries and usage page on the UnQL homepage offers a great overview of the intended power of the language.

Is it the future?

At this stage of development of the language, it’s pretty hard to say whether UnQL could really become the future of querying databases. The syntax summary does however hint at a construct that could be nothing short of game changing for NoSQL databases; transactions. At this point in time, you basically need a relational database if you want any sort of transactional safety, which, even for pretty simplistic business logic and database structures, isn’t just a nice to have but rather a need to have.

The syntax of the language does resemble SQL a whole lot, which of course raises the question: will UnQL simply become a superset of SQL? More relevantly, will UnQL standardize NoSQL databases like SQL essentially did for relational databases? Whether this is a good or a bad thing depends a lot on your perspective, but with the competition in the NoSQL market, I’m pretty sure that standardization is going to be a lot harder and offer far more headaches than we’ve had for many years with the pretty minute deviations in SQL syntaxes of different relational databases.

One thing is for sure, though. NoSQL really needs something like UnQL — NoSQL needs to start moving forward in terms of abilities to become serious contenders for more than just simple data mining.