In this post I am going to address some misconceptions on the difference and
cyclic redundancy checks (CRCs)
and generalized hash functions,
cryptographic hash functions.
We’ll start with an explanation of what the two are. Both CRCs and hash
functions share the property that they take an input value and reduce it to a
(usually shorter) output value. Typically inputs can be arbitrarily many bits in
length, and the output is a small value, frequently in the range of 32 to 512
bits. By convention the output value for a CRC is called a “checksum”, and the
output value for a hash function is called a “digest”.
CRCs are a type of error-detecting code used to implement checksums. CRCs are
specifically designed to satisfy the property that they can detect transmission
errors in data. The idea is that you’re transmitting a signal over a potentially
lossy medium (e.g. Ethernet or radio waves) and you want to detect any data loss
or data corruption. The CRC is computed over the data payload and sent as part
of the data transmission. The receiver can recompute the CRC and compare it to
the received value to detect errors. No property of the CRC matters other than
whether it differs when data corruption occurs. It isn’t important if the CRC is
biased in any way as long as transmission errors result in differing CRCs.
In general a CRC can be n-bits in length, but most commonly a 32-bit CRC is
used. An n-bit CRC has the property that any error run of up to n bits is
mathematically guaranteed to produce a distinct CRC checksum. This guarantee
is possible because CRCs are computed using a special type of polynomial that is
able to guarantee this property. As a special case, this typically means that an
error of a single bit (or a single byte) will guarantee that the CRC checksum
will fail, regardless of any other property of the transmitted data, including
its length. This is a powerful property, since it guards against the most common
type of transmission errors.
General purpose hashes, including cryptographic hashes, are a little different.
Again, they take an input value and reduce it to a (typically smaller) digest
value. Hash functions optimize for a different property, however. Hashes try to
optimize for the case of reducing bias in the output digest value. That is,
hashes should attempt to have unbiased output values even when the input is
biased. Hash functions also try to optimize to reduce hash collisions for
differing input values, but there are usually no guarantees made on what
conditions can lead to a hash collision other than probabilistic ones.
It’s inappropriate to use a CRC in place of a general purpose hash function
because CRCs usually have biased output. It’s equally inappropriate to use a
general purpose hash function in place of a CRC because general purpose hash
functions usually do not make any guarantees on the conditions under which hash
collisions can occur.
Most modern cryptographic hash functions have very large digest values compared
to what is typically used in a CRC. For instance, the most widely used CRC that
I’ve seen used is CRC-32 which has multiple variations, but all of which produce
a 32-bit checksum value. By contrast MD5
has a 128-bit digest value and SHA-1 has
a 160-bit digest value. More modern cryptographic hash functions frequently have
even larger digest sizes. Accordingly CRC values have somewhat fallen by the
wayside in many modern applications. However, it’s still important to understand
how the two are different. In particular, it’s probably a bad idea to use a hash
function for a CRC if you are going to reduce the size of the digest to the same
size as a CRC checksum anyway. It’s also worth noting that the CRC algorithm can
be parameterized by larger values for n, so tit’s also possible to produce
large CRC values such as CRC-128 (or higher).