CRCs vs Hash Functions

In this post I am going to address some misconceptions on the difference and utility of cyclic redundancy checks (CRCs) and generalized hash functions, including cryptographic hash functions.

We'll start with an explanation of what the two are. Both CRCs and hash functions share the property that they take an input value and reduce it to a (usually shorter) output value. Typically inputs can be arbitrarily many bits in length, and the output is a small value, frequently in the range of 32 to 512 bits. By convention the output value for a CRC is called a "checksum", and the output value for a hash function is called a "digest".

CRCs are a type of error-detecting code used to implement checksums. CRCs are specifically designed to satisfy the property that they can detect transmission errors in data. The idea is that you're transmitting a signal over a potentially lossy medium (e.g. Ethernet or radio waves) and you want to detect any data loss or data corruption. The CRC is computed over the data payload and sent as part of the data transmission. The receiver can recompute the CRC and compare it to the received value to detect errors. No property of the CRC matters other than whether it differs when data corruption occurs. It isn't important if the CRC is biased in any way as long as transmission errors result in differing CRCs.

In general a CRC can be n-bits in length, but most commonly a 32-bit CRC is used. An n-bit CRC has the property that any error run of up to n bits is mathematically guaranteed to produce a distinct CRC checksum. This guarantee is possible because CRCs are computed using a special type of polynomial that is able to guarantee this property. As a special case, this typically means that an error of a single bit (or a single byte) will guarantee that the CRC checksum will fail, regardless of any other property of the transmitted data, including its length. This is a powerful property, since it guards against the most common type of transmission errors.

General purpose hashes, including cryptographic hashes, are a little different. Again, they take an input value and reduce it to a (typically smaller) digest value. Hash functions optimize for a different property, however. Hashes try to optimize for the case of reducing bias in the output digest value. That is, hashes should attempt to have unbiased output values even when the input is biased. Hash functions also try to optimize to reduce hash collisions for differing input values, but there are usually no guarantees made on what conditions can lead to a hash collision other than probabilistic ones.

It's inappropriate to use a CRC in place of a general purpose hash function because CRCs usually have biased output. It's equally inappropriate to use a general purpose hash function in place of a CRC because general purpose hash functions usually do not make any guarantees on the conditions under which hash collisions can occur.

Most modern cryptographic hash functions have very large digest values compared to what is typically used in a CRC. For instance, the most widely used CRC that I've seen used is CRC-32 which has multiple variations, but all of which produce a 32-bit checksum value. By contrast MD5 has a 128-bit digest value and SHA-1 has a 160-bit digest value. More modern cryptographic hash functions frequently have even larger digest sizes. Accordingly CRC values have somewhat fallen by the wayside in many modern applications. However, it's still important to understand how the two are different. In particular, it's probably a bad idea to use a hash function for a CRC if you are going to reduce the size of the digest to the same size as a CRC checksum anyway. It's also worth noting that the CRC algorithm can be parameterized by larger values for n, so tit's also possible to produce large CRC values such as CRC-128 (or higher).