There’s a conversation that I’ve had a few times in the last ~10 years that I’ve
been a software engineer, that goes something like this. Databases are known to
be a hard problem. Databases are the central scalability bottleneck of many
large engineering organizations. At the same time, major tech companies like
Google and Facebook have famously built in-house proprietary databases that are
extremely robust, powerful, and scalable, and are the lynchpins of their
business. Likewise, both open source and commercial database offerings have been
rapidly evolving to meet the scalability problems that people face. So the
question is: will we reach a point where one or a small number of open source or
commercial databases become so good, so powerful, and highly available, and so
horizontally scalable that other solutions will fall by the wayside? That is,
will we converge on just a few commodified databases?
A number of smart people I’ve worked with seem to think that this is the case.
However, I’m skeptical, or think that if it is the case it will be a very long
The first part of my reasoning here hinges on why the database world has become
so fragmented today. When the first relational databases came to market in the
late seventies and early eighties they actually came to market in co-evolution
with the SQL standard. The first commercial SQL database on the market was from
Oracle in 1979 followed very closely by SQL databases from other companies like
IBM (who published their first SQL database in 1981). The reason that these
companies came to market with SQL databases at nearly the same time is because
the SQL standard had been a number of years in the making at that point. EF Codd
published his seminal paper “A Relational Model of Data for Large Shared Data
Banks” in 1970, and throughout the 1970s people at IBM and elsewhere had been
designing the SQL standard.
For at least 25 years SQL reigned supreme as the de facto way to query and
interact with large databases. The standard did evolve, and there were always
other offerings, especially in the embedded world or in special purpose
applications like graph databases or hierarchical databases. But the reality is
that for several decades, when people were talking about large scale databases
they were nearly always talking about large scale SQL databases.
However, if you want to build a really big highly partitioned database, SQL is
not a great way to go. The relational data model makes partitioning
difficult—you generally need application specific knowledge to efficiently
partition a database that allows joins. Also, in many cases the SQL standard
provides too much functionality. If you relax some of the constraints that SQL
imposes there are massive performance and scaling wins to be had.
The thing is, while SQL certainly has a lot of defficiencies, it’s at least a
standard. You can more or less directly compare the feature matrix of two
relational databases. While every benchmark has to be taken with a grain of
salt, the fact that SQL is a standardized language means it’s possible to create
benchmarks for different SQL databases and compare them that way. The
standardization of SQL is a big part of the reason why it’s kept hold for so
Non-relational databases and NewSQL databases like
RocksDB, and many others all take the traditional SQL
database model and modify it in different ways, in some cases drastically. This
means these databases are hard to directly compare to each other because their
features and functionality differ so much.
There are also orders-of-magnitude improvements to be had by relaxing certain
durability guarantees and by having certain special purpose data structures. In
the most extreme example, a database that consists of an append-only log can max
out hard drives on write speed even though such a database would be infeasibly
inefficient to query. You could still think of such a thing as a database, and
this is similar to what something like
or Scribe or
Kafka is. Likewise, a database that consists of a
properly balanced hash table can provide extremely efficient reads for single
keys at the cost of sacrificing the ability to do range queries and at the cost
of potentially expensive rebalancing operations. For instance, in the most
extreme example a read-only database like cdb can do
reads with one disk seek for misses and two disk seeks for successful lookups.
There are so many different tradeoffs here in terms of durability gurantees,
data structures, read performance, write performance, and so on that it’s
impossible to prove that one database is more efficient than another.
Even in more general purpose databases that do log-structured writes and can
efficiently perform range queries, the specifics of how you do writes, how you
structure your keys,
the order you insert keys,
how pages are sized, etc. can make huge differences in efficiency.
One final thing to consider on this efficiency aspect. Missing indexes can turn
operations that should be blazing fast into ridiculously efficient table scans.
It’s easy to completely saturate the I/O on a hard drive (even an SSD) with a
few concurrent table scans. Some of the smarter databases will prevent you from
doing this altogether (i.e. they will return an error instead of scanning), but
even then it’s easy to accidentally write something tantamount to a table scan
with an innocent looking
for loop. In many cases it’s straightforward to look
at a query and with some very basic cardinality estimates come up with a
reasonable suggestion index for that query, so in theory it’s possible to
automatically suggest or build appropriate indexes. However, one can just as
easily run the risk of having too many indexes, which can be just as deleterious
to database performance. Automatically detecting the
database-query-in-a-for-loop case is certainly possible but not trivial. In
other words, there is no silver bullet to the indexing problem.
The point is, once you move beyond SQL all bets are off. If we as an industry
were to standardize on a new database query protocol it might be possible to
return to a somewhat normal world where it’s possible to compare different
database products in a sane way—but today it’s not. Even internally at
Google, which is famous for probably having the most state of the art database
products, there isn’t a single database used—different products there are
built on different technologies like BigTable, Megastore, Spanner, F1, etc.
Maybe one day we will reach the point where servers are so cheap and so powerful
that people can have massive petabyte scale databases with microsecond lookups
on commodity hardware. Then maybe the hardware will truly be so fast that the
specifics of the software implementation will be irrelevant (although even then,
the datastructures will have different running complexity and therefore
benchmarks will still show “big” wins for one technology or another). However, I
think that those days are quite a ways off. In the meantime, there will be a lot
of companies that need terabyte and petabyte scale databases, there will be a
lot of small startups that evolve into tech giants, and there will be a need to
build special, custom purpose databases.
I’m not prescient, and the computer industry is one where long term predictions
are notoriously difficult. However, at this point databases have been considered
a difficult problem for more than fifty years. If nothing else, purely based on
the longevity of the challenge thus far I think it’s reasonable to assume that
it will still be considered a challenging problem in computer science at least a
decade from now, if not a few decades from now.