This is a followup to my last post, Making Bitcoin Fast:
Introduction. This post
is going to explain some background about what UTXOs are and what the chainstate
database is used for.
Originally I planned to include information about how UTXO data is actually
encoded in LevelDB, but to keep this post at a manageable length I’m saving that
for the next post. Experienced Bitcoiners will already be familiar with most of
the content here, so this is more targeted at people who don’t understand the
UTXO model or what the chainstate database is used for. The next post will be
more technical (and is mostly written already).
The Chainstate Database
Bitcoin is a trustless, distributed digital currency. To operate a Bitcoin node
you need to be able to do the basic accounting task of validating new
transactions you see on the network. This means you need to be how much money
everyone has, and who actually has the money. The fact that there’s 170 GB of
historical data in Bitcoin’s blockchain is totally irrelevant for accounting.
Let’s say Alice has 1 BTC and Bob has 0 BTC. Alice creates a transaction that
sends her money to Bob. After receiving the payment Bob sends it back to Alice.
They do this 100 times, and at the end Alice ends up with 1 BTC (minus whatever
was lost due to transaction fees) and Bob has 0 BTC. If you’re operating a node
right now, you don’t really care about those 100 transactions. You need to
validate and track these transactions as they occur. But at the end, all you
care about is the final state of the system.
The information about how much money everyone has right now is called the
chainstate database. I call it the “chainstate” database because that’s
literally the name of the directory that Bitcoin creates. This is also called
the UTXO set. You can think of the chainstate database as the information that’s
actually critical for doing accounting tasks. It does not contain old
historical data. It only contains information about balances that matter right
Understanding this is critical to understanding why Bitcoin has any chance of
scaling at all. Right now the Bitcoin blockchain is about 170 GB. That’s a lot
of data, but only a small fraction of that is data is unneeded when operating a
Bitcoin node. The chainstate database is about 3 GB in size. That’s a pretty
modest number all things considered. You could fit the chainstate database on a
typical flash drive or phone. This series of posts is going to be about this
database. The right way to think about this is the database that stores the
relatively small amount of data that’s actually required to do the accounting
tasks required when running a node.
If you have a full Bitcoin node running you can check the size of your copy of
this database using the following command:
# This commands lists the size of the chainstate database.
$ du -sh ~/.bitcoin/chainstate
The exact number you’ll see might be a little bigger or smaller than this due to
some implementation details about how LevelDB works, but the number you see
shouldn’t be too much larger than what I’ve shown here.
UTXOs And Bitcoin’s Accounting Model
In the last section I explained that the chainstate database stores accounting
information about active balances in the Bitcoin. I also noted that this is
called the “UTXO set”. In this section I will explain more about what UTXOs are,
and the basic accounting model used by Bitcoin.
The Accounts Model
Consider for a moment how accounting for a standard bank checking account works.
If you have a checking account you have an account and routing number that
identify your account. Your bank uses these two numbers to track the balance in
your checking account. When you cash a check that balance goes up, and when you
make a payment that balance goes down. This is called the “accounts” model, and
for many people it’s the natural way of thinking about tracking payments and
balances. It’s an accounts model because balances are associated with accounts.
This is model is used by most other monetary instruments provided by a
traditional bank. For instance, debit cards and credit cards have a 16 digit
number on them that identifies which account the card is associated with.
You could build a cryptocurrency with this kind of accounting model. The way you
would do this would be to have everyone generate a private/public key pair. The
public key would act like an account number. This is essentially how Ethereum
and most other cryptocurrencies work. However, this kind of model has a number
To start with, it’s not very private or anonymous. Anyone can anonymously create
an account by generating private/public key pair without revealing their real
world identity. But this anonymity goes out the window as soon as you actually
have to transact with someone. For instance, if you want to split the bill after
getting dinner with friends you need to get you their account numbers (or
“addresses”). Once you know this you can figure out exactly how much money your
friends have. This is kind of weird, and not information most people want to
reveal to everyone they interact with.
There’s another flaw of the accounts model though that’s not as well understood.
In a lot of ways I think this second flaw is more interesting, and perhaps a
more serious problem. The problem is related to making sure balances cannot go
below zero, which I’ll explain here using an analogy related to making an online
purchase with a traditional debit card.
Suppose I have a VISA debit card and my balance is $100. I go to two online
sites, and try to make $90 purchases on both of them at the exact same time
(e.g. I had the “Buy” button ready to go in two browser tabs). The expectation
is that one of the payments will go through and the other will be rejected. They
definitely should not both be accepted. This works because a central computer on
the VISA network picks an ordering for the transactions. It decides that
transaction A was first and transaction B was second. It sees transaction
A as spending available funds and then debits funds from the account. Then it
sees B as overdrafting, and rejects the transaction. When the transactions
happen simultaneously it doesn’t matter which transaction is selected as the
first transaction. All that matters is that some ordering is chosen and
This is central to the accounts models: all of the transactions made by an
account need to have a strict sequential ordering applied to them. This is the
only way to ensure that account balances never goes below zero. How do you
actually do that in a distributed system though? It’s easy for VISA to do
because they can route all of the transactions for an account to a central
computer that enforces an ordering. It’s not so easy in a distributed peer to
peer system that doesn’t have a central authority. There are ways to do it, but
they’re complicated and have some serious drawbacks, particularly when
considering maximum transaction throughput, particularly when considering
maximum transaction throughput.
The UTXO Accounting Model
Bitcoin does not use an accounts model. Bitcoin uses a novel “unspent
transaction output” (or UTXO) model. Compared to an accounts model, the UTXO
model provides improved privacy, and also elegantly solves the problem of making
sure balances can’t go below zero in a distributed system. The way this works is
a little strange if you haven’t encountered it before, so don’t worry if this
doesn’t make sense the first read through.
I think the easiest way to explain this is to start with how new coins are
created. The job of miners is to find a new block to append to the blockchain.
They look at pending transactions and come up with a valid block containing some
of these transactions. This done using a system called “proof of work”, which
ensure that blocks are found in a fair distributed manner, without relying on
traditional synchronization mechanisms. Each block has a list of transactions.
There’s a special transaction in each block called the
coinbase transaction. This transaction
creates a brand new coin owned by the miner. Currently the block reward is 12.5
BTC, which is the inflation amount of new coins created in each block.
Let’s say Alice is a miner who recently mined a new block. At this point she has
a coin worth 12.5 BTC. Alice wants to send Bob 4 BTC. She creates a transaction
that has two outputs (which roughly correspond to Bitcoin payment addresses).
The first output is for 4 BTC and goes to an address Bob provides her. For the
second output Alice creates a public/private key pair, and sends the public key
8.5 BTC of change. In this process we
started with one coin worth 12.5 BTC and ended up with two coins worth smaller
amounts. The original 12.5 BTC coin is considered to be spent, and the two new
coins are unspent. These two new coins are “unspent transaction outputs”, or
As we saw coins can be split as the result of a transaction, where one coins
becomes two. The opposite situation happens when making a high value payment
from a number of lower value coins. For example, let’s say Bob already had coins
worth 1 and 2 BTC respectively. At this point he has a total of 7 BTC spread
across three different coins (1 + 2 + 4). If he wants so send Carol 6.5 BTC he
would create a transaction that uses all three of his coins as inputs. The
transaction sends 5.5 BTC to an address given to him by Carol, and 0.5 BTC back
to a new change address he generates. In this example the transaction turned
three UTXOs into two UTXOs. In general a transactions have one or more inputs
and one or more outputs, and a transaction can increase or decrease the total
number of UTXOs.
The reason this simplifies the accounting model is that we don’t need to
sequence transactions in the same way as an accounts model. Coins are in a
binary state, spent or unspent. Since coins can only be spent once, we only need
to track unspent coins. This also helps during periods of network congestion
because it makes it possible for wallets to create transactions that are
independent from one another. This independence of transactions prevents a low
fee transaction from blocking a high fee transaction. This system isn’t strictly
simpler, as some other things become more complicated when using UTXOs. But from
an accounting point of view the system is simple and easy to reason about.
By the way, these UTXOs are quite literally called “coins” in the Bitcoin source
code. A coin does not have to have a unitary monetary value, which makes the
term a little confusing. For that reason I generally prefer to call them UTXOs.
UTXO Age Distribution
Every UTXO is created in some block. We can consider the age of a UTXO to be the
age of the block that first included it. I was interested in learning more about
the age distribution of UTXOs, so I wrote some code to dump my local UTXO set
and generated some graphs and statistics.
Since a UTXO can only be spent once, my expectation was that most UTXOs would
relatively young, i.e. there would be a lot more recently created UTXOs than old
UTXOs. The actual results were somewhat surprising to me. In the graph below the
blue line shows the proportion
density of UTXOs
in each bin, and the red line shows the cumulative
UTXOs age over time.
As you can see from this data, there is a bias towards young UTXOs, but the bias
is really not that extreme. About 50% of the UTXOs have an age of 100k blocks or
less. At 10 minutes per block, 100k blocks works out to about two years. This
was surprising to me, and I had expected the distribution to more strongly
biased towards recent UTXOs.
If you want to learn more about the age distribution I recommend checking out
utxo-stats.com which presents this data in another
more visually appealing way.
The next post will go more into database structure and caching details, but you
can start thinking about some of the problem this poses now. In a database with
asymmetric access patterns you can exploit things like recency to improve cache
hit rates. This is most effective when there’s a strong bias in the data, and
the bias shown here exists but is fairly weak.
A Glimpse At The Future
Today you need to download and validate the full Bitcoin blockchain all the way
from the genesis block in January 2009 to locally build the chainstate database.
This is about 170 GB of data, and it’s a fairly onerous task. This is a big
problem because it’s only going to get worse over time. Downloading 170 GB of
data to build a 3 GB database isn’t very efficient. A lot of people have
correctly criticized Bitcoin by pointing out that this doesn’t scale. A lot of
Bitcoin and blockchain skeptics have also prematurely jumped to the conclusion
that Bitcoin doesn’t have a long term future due to these data storage
There’s a really interesting area of active research about how to solve this
problem. The basic idea is that if you could come up with a trustless way to
bootstrap a copy of the chainstate database you wouldn’t need to download and
validate 170 GB of historical data. There are several different proposed
solutions, and a number of Bitcoin developers are thinking about this problem
and working on it right now. I’m not prescient and I don’t have a crystal ball,
but I think one of these is likely to make it into Bitcoin in the next two years
(which takes into account the time it takes to build consensus around a BIP and
actually deploy it).
The one that I think is most likely to make it into Bitcoin first is some kind
of scheme built around Merkle sets. Bram Cohen (of BitTorrent fame) presented
about this topic at the SF Bitcoin Developers meetup last year, and you can find
a video of that on YouTube or
This isn’t the only idea, and there are a few really fascinating ideas using
novel cryptographic primitives like cryptographic
accumulators. One of
the more well known proposals that you may have heard of is called
However the Merkle set idea is probably the simplest proposal, and the one
that’s the most compatible with how Bitcoin actually works today. There are a
lot of details that need to be ironed out, but the basic approach seems sound
and has drawn a lot of interest. If you’re interested in Bitcoin I definitely
encourage you to read more about these ideas.
Regardless of what kind of scheme is used to bootstrap nodes in the future,
Bitcoin will still need a chainstate database that operates essentially how it
does today. There’s no getting around the fact that you need to store accounting
information in a database.