Unexpected Places You Can And Can't Use Null Bytes

February 1, 2017

The traditional way of representing strings in C is using null-terminated character arrays. Common C library methods like strcpy, printf, etc. detect how long strings are by sequentially scanning memory until a null byte is found. This complicates situations where the string data itself should contain literal null bytes. Generally this doesn't mean that it's impossible to use strings with embedded null characters, just that you have to be more careful when doing so. Typically this is done by using methods that explicitly specify the length of strings.

For instance, one may run into problems using printf(3) to write a string containing null bytes to stdout; but the limitation can be worked around using fwrite(3) which accepts a parameter describing how large the data buffer is:

// INCORRECT: won't work as expected if str contains null bytes
printf("%s", str);

// OK: no problems with embedded null bytes here
fwrite(str, sizeof char, nbytes, stdout);

However, there are a few places where you don't have this option, and embedded null bytes just won't work. In these situations you really, truly can't use strings that contain null bytes. In this article I'm going to give a few examples of places where you cannot use null bytes; and one surprising place where you can.

Filesystem Paths

On Unix, a path is defined to be a null-terminated C string. This means you can't have a file whose name is foo\0bar or any other string containing null bytes.

To open a file on Unix you use the open(2) syscall. That method has the following signature:

int open(const char *pathname, int flags, mode_t mode);

As you can see, there's no parameter describing the length of the path---the kernel treats the path parameter as a regular null-terminated C string. If you were to supply foo\0bar as the path, there's no way the kernel would be able to disambiguate this from the string foo. You can confirm this by looking at fs/open.c in the Linux kernel, which is the file that defines open(2) and most of the other file-oriented system calls. Look for the line that starts with SYSCALL_DEFINE3(open, and you'll see there's no trickery involved here. Again, there are quite a few other system calls operating on file names defined in this file, and all of them define paths using const char * parameters.

While we're on the topic, it's worth noting that the only other restriction on filenames is that that they cannot contain a /, which is the character used to denote directories. Filenames can contain arbitrary other binary data, including spaces and newlines, and there's no defined character encoding.

Command Arguments

C programs define an entry point called main() with the following function prototype:

int main(int argc, char **argv);

The parameter argc contains the number of arguments, and argv is an array of null-terminated strings representing the command line arguments for the function (with a final null element). Suppose you were to try to encode a null byte in one of the argv parameters. The issue is that there is no parameter specifying the lengths of the strings in argv. Therefore there's no way for the invoked program to know if the argv parameters have embedded null bytes or not.

You might wonder if this is just a limitation of the C API. For instance, what if you write a program in assembly? Is there another way to access the argument parameters and get their size?

Actually, the answer is no. Here's one way to think about it. The prototype for execve(2) is like this:

int execve(const char *filename, char *const argv[], char *const envp[]);

The parameters to execve need to be passed through to the new process. To do this, the kernel must copy this data into the memory of the new process. Since the kernel is taking regular char * types (or in this case, arrays of them), it has to assume they're null-delimited when copying them.

If you have a program that needs to be able to work with embedded null bytes for parameters, you should have a way to specify such parameters either on stdin, or via a file; or preferably, both.

Environment Variables

If you were paying close attention above, you'll recall that execve(2) actually takes three parameters. The third parameter is a list of environment variables for the new process. In fact, most C compilers on Unix systems will allow you to define main() with the prototype:

int main(int argc, char **argv, char **envp);

Under the hood, library calls like getenv(3) and setenv(3) are implemented by accessing this environment array (or a copy of it).

You can't have null bytes in environment variables (or their values) for exactly the same reason that you can't have them in argv parameters.

Bonus: "Abstract" Unix Domain Sockets

In this "bonus" section I'm going to describe an unexpected place where you can use embedded null bytes. Linux implements an esoteric, non-standard extension for AF_UNIX sockets that allows you to use null bytes in a surprising way. This feature is documented in the Linux man page for unix(7).

Here's how it works. The system calls bind(2) and connect(2) accept a sockaddr struct describing the address to connect to or bind on, as well as another parameter called addrlen that describes the size of the sockaddr struct. The value for the addrlen parameter should literally be the sizeof of the addr struct:

int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

The reason this unusual addrlen parameter is required is because the socket address needs to be polymorphic for different types of socket structures. In C there are multiple socket families that all have different addressing schemes. For instance, the address of an IPv4 sockets uses a 32-bit IPv4 address, the address of a IPv6 sockets uses a 128-bit IPv6 address, and the address of a Unix socket normally uses a filesystem path. These are all represented with different underlying struct types. For instance, an IPv4 socket uses a struct sockaddr_in, an IPv6 socket uses a struct sockaddr_in6, and a Unix socket uses a struct sockaddr_un.

Since C doesn't have real polymorphism, what you do is declare a concrete type like sockaddr_in or sockaddr_un and then supply bind(2) and connect(2) with a pointer to that struct, cast as a sockaddr *. The true length of the underlying socket address is given via addrlen. In a more modern language you'd implement this polymorphism using inheritance or some type of abstract interface, but C doesn't have these capabilities. The addrlen parameter can be thought of as a clever hack to work around this limitation of the C language.

For Unix sockets on Linux, the sockaddr_un type is defined like this:

struct sockaddr_un {
    sa_family_t sun_family;               /* AF_UNIX */
    char        sun_path[108];            /* pathname */
};

Normally you'd put a regular null-terminated filesystem path in the sun_path field.

The "abstract" socket feature on Linux instead works like this: you set the first byte in sun_path to be \0, and then put up to 107 additional bytes after it. Then the addrlen parameter to bind(2) or connect(2) is set to be sizeof(sa_family_t), which is two, plus the number of bytes you put into sun_path, including the initial null byte.

The kernel looks at the first two bytes in the addr pointer which always holds a sa_family_t representing the socket family type. If it sees AF_UNIX, it then computes the size of the value in sun_path using addrlen - 2. In this way the kernel can explicitly tell how large the value stored in sun_path is, which is why using an initial null byte is possible. If the first byte in sun_path is zero then the kernel considers the socket name to be "abstract". Such a socket will exist in memory in the kernel, but does not correspond to a filesystem path.

Abstract sockets have a few interesting uses. One interesting thing about them is that they're reference counted by the kernel. For regular Unix sockets defined in the filesystem, you may need to take care to remove stale sockets from the filesystem after use. Abstract sockets don't have this problem: once an abstract socket is no longer in use by any process, the kernel automatically cleans it up.