The Commodification of Databases

August 31, 2015

There's a conversation that I've had a few times in the last ~10 years that I've been a software engineer, that goes something like this. Databases are known to be a hard problem. Databases are the central scalability bottleneck of many large engineering organizations. At the same time, major tech companies like Google and Facebook have famously built in-house proprietary databases that are extremely robust, powerful, and scalable, and are the lynchpins of their business. Likewise, both open source and commercial database offerings have been rapidly evolving to meet the scalability problems that people face. So the question is: will we reach a point where one or a small number of open source or commercial databases become so good, so powerful, and highly available, and so horizontally scalable that other solutions will fall by the wayside? That is, will we converge on just a few commodified databases?

A number of smart people I've worked with seem to think that this is the case. However, I'm skeptical, or think that if it is the case it will be a very long time coming.

The first part of my reasoning here hinges on why the database world has become so fragmented today. When the first relational databases came to market in the late seventies and early eighties they actually came to market in co-evolution with the SQL standard. The first commercial SQL database on the market was from Oracle in 1979 followed very closely by SQL databases from other companies like IBM (who published their first SQL database in 1981). The reason that these companies came to market with SQL databases at nearly the same time is because the SQL standard had been a number of years in the making at that point. EF Codd published his seminal paper "A Relational Model of Data for Large Shared Data Banks" in 1970, and throughout the 1970s people at IBM and elsewhere had been designing the SQL standard.

For at least 25 years SQL reigned supreme as the de facto way to query and interact with large databases. The standard did evolve, and there were always other offerings, especially in the embedded world or in special purpose applications like graph databases or hierarchical databases. But the reality is that for several decades, when people were talking about large scale databases they were nearly always talking about large scale SQL databases.

However, if you want to build a really big highly partitioned database, SQL is not a great way to go. The relational data model makes partitioning difficult---you generally need application specific knowledge to efficiently partition a database that allows joins. Also, in many cases the SQL standard provides too much functionality. If you relax some of the constraints that SQL imposes there are massive performance and scaling wins to be had.

The thing is, while SQL certainly has a lot of defficiencies, it's at least a standard. You can more or less directly compare the feature matrix of two relational databases. While every benchmark has to be taken with a grain of salt, the fact that SQL is a standardized language means it's possible to create benchmarks for different SQL databases and compare them that way. The standardization of SQL is a big part of the reason why it's kept hold for so long.

Non-relational databases and NewSQL databases like Riak, Cassandra, HBase, CockroachDB, RocksDB, and many others all take the traditional SQL database model and modify it in different ways, in some cases drastically. This means these databases are hard to directly compare to each other because their features and functionality differ so much.

There are also orders-of-magnitude improvements to be had by relaxing certain durability guarantees and by having certain special purpose data structures. In the most extreme example, a database that consists of an append-only log can max out hard drives on write speed even though such a database would be infeasibly inefficient to query. You could still think of such a thing as a database, and this is similar to what something like syslog or Scribe or Kafka is. Likewise, a database that consists of a properly balanced hash table can provide extremely efficient reads for single keys at the cost of sacrificing the ability to do range queries and at the cost of potentially expensive rebalancing operations. For instance, in the most extreme example a read-only database like cdb can do reads with one disk seek for misses and two disk seeks for successful lookups. There are so many different tradeoffs here in terms of durability gurantees, data structures, read performance, write performance, and so on that it's impossible to prove that one database is more efficient than another.

Even in more general purpose databases that do log-structured writes and can efficiently perform range queries, the specifics of how you do writes, how you structure your keys, the order you insert keys, how pages are sized, etc. can make huge differences in efficiency.

One final thing to consider on this efficiency aspect. Missing indexes can turn operations that should be blazing fast into ridiculously efficient table scans. It's easy to completely saturate the I/O on a hard drive (even an SSD) with a few concurrent table scans. Some of the smarter databases will prevent you from doing this altogether (i.e. they will return an error instead of scanning), but even then it's easy to accidentally write something tantamount to a table scan with an innocent looking for loop. In many cases it's straightforward to look at a query and with some very basic cardinality estimates come up with a reasonable suggestion index for that query, so in theory it's possible to automatically suggest or build appropriate indexes. However, one can just as easily run the risk of having too many indexes, which can be just as deleterious to database performance. Automatically detecting the database-query-in-a-for-loop case is certainly possible but not trivial. In other words, there is no silver bullet to the indexing problem.

The point is, once you move beyond SQL all bets are off. If we as an industry were to standardize on a new database query protocol it might be possible to return to a somewhat normal world where it's possible to compare different database products in a sane way---but today it's not. Even internally at Google, which is famous for probably having the most state of the art database products, there isn't a single database used---different products there are built on different technologies like BigTable, Megastore, Spanner, F1, etc.

Maybe one day we will reach the point where servers are so cheap and so powerful that people can have massive petabyte scale databases with microsecond lookups on commodity hardware. Then maybe the hardware will truly be so fast that the specifics of the software implementation will be irrelevant (although even then, the datastructures will have different running complexity and therefore benchmarks will still show "big" wins for one technology or another). However, I think that those days are quite a ways off. In the meantime, there will be a lot of companies that need terabyte and petabyte scale databases, there will be a lot of small startups that evolve into tech giants, and there will be a need to build special, custom purpose databases.

I'm not prescient, and the computer industry is one where long term predictions are notoriously difficult. However, at this point databases have been considered a difficult problem for more than fifty years. If nothing else, purely based on the longevity of the challenge thus far I think it's reasonable to assume that it will still be considered a challenging problem in computer science at least a decade from now, if not a few decades from now.