The traditional way of representing strings in C is
using
null-terminated character arrays.
Common C library methods like strcpy
, printf
, etc. detect how long strings are
by sequentially scanning memory until a null byte is found. This complicates
situations where the string data itself should contain literal null bytes.
Generally this doesn't mean that it's impossible to use strings with embedded
null characters, just that you have to be more careful when doing so. Typically
this is done by using methods that explicitly specify the length of strings.
For instance, one may run into problems using printf(3) to write a string containing null bytes to stdout; but the limitation can be worked around using fwrite(3) which accepts a parameter describing how large the data buffer is:
// INCORRECT: won't work as expected if str contains null bytes
printf("%s", str);
// OK: no problems with embedded null bytes here
fwrite(str, sizeof char, nbytes, stdout);
However, there are a few places where you don't have this option, and embedded null bytes just won't work. In these situations you really, truly can't use strings that contain null bytes. In this article I'm going to give a few examples of places where you cannot use null bytes; and one surprising place where you can.
Filesystem Paths
On Unix, a path is defined to be a null-terminated C string. This means you
can't have a file whose name is foo\0bar
or any other string containing
null bytes.
To open a file on Unix you use the open(2)
syscall. That method has the
following signature:
int open(const char *pathname, int flags, mode_t mode);
As you can see, there's no parameter describing the length of the path---the
kernel treats the path
parameter as a regular null-terminated C string. If you
were to supply foo\0bar
as the path, there's no way the kernel would be able
to disambiguate this from the string foo
. You can confirm this by looking
at
fs/open.c in
the Linux kernel, which is the file that defines open(2)
and most of the other
file-oriented system calls. Look for the line that starts with
SYSCALL_DEFINE3(open
, and you'll see there's no trickery involved here. Again,
there are quite a few other system calls operating on file names defined in this
file, and all of them define paths using const char *
parameters.
While we're on the topic, it's worth noting that the only other restriction on
filenames is that that they cannot contain a /
, which is the character used to
denote directories. Filenames can contain arbitrary other binary data, including
spaces and newlines, and there's no defined character encoding.
Command Arguments
C programs define an entry point called main()
with the following function
prototype:
int main(int argc, char **argv);
The parameter argc
contains the number of arguments, and argv
is an array of
null-terminated strings representing the command line arguments for the function
(with a final null element). Suppose you were to try to encode a null byte in
one of the argv
parameters. The issue is that there is no parameter specifying
the lengths of the strings in argv
. Therefore there's no way for the invoked
program to know if the argv
parameters have embedded null bytes or not.
You might wonder if this is just a limitation of the C API. For instance, what if you write a program in assembly? Is there another way to access the argument parameters and get their size?
Actually, the answer is no. Here's one way to think about it. The prototype for
execve(2)
is like this:
int execve(const char *filename, char *const argv[], char *const envp[]);
The parameters to execve
need to be passed through to the new process. To do
this, the kernel must copy this data into the memory of the new process. Since
the kernel is taking regular char *
types (or in this case, arrays of them),
it has to assume they're null-delimited when copying them.
If you have a program that needs to be able to work with embedded null bytes for parameters, you should have a way to specify such parameters either on stdin, or via a file; or preferably, both.
Environment Variables
If you were paying close attention above, you'll recall that execve(2)
actually takes three parameters. The third parameter is a list of environment
variables for the new process. In fact, most C compilers on Unix systems will
allow you to define main()
with the prototype:
int main(int argc, char **argv, char **envp);
Under the hood, library calls like getenv(3) and setenv(3) are implemented by accessing this environment array (or a copy of it).
You can't have null bytes in environment variables (or their values) for exactly
the same reason that you can't have them in argv
parameters.
Bonus: "Abstract" Unix Domain Sockets
In this "bonus" section I'm going to describe an unexpected place where you
can use embedded null bytes. Linux implements an esoteric, non-standard
extension for AF_UNIX
sockets that allows you to use null bytes in a
surprising way. This feature is documented in the Linux man page
for unix(7).
Here's how it works. The system calls bind(2)
and connect(2)
accept a
sockaddr
struct describing the address to connect to or bind on, as well as
another parameter called addrlen
that describes the size of the sockaddr
struct. The value for the addrlen
parameter should literally be the sizeof
of the addr
struct:
int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);
The reason this unusual addrlen
parameter is required is because the socket
address needs to be polymorphic for different types of socket structures. In C
there are multiple socket families that all have different addressing schemes.
For instance, the address of an IPv4 sockets uses a 32-bit IPv4 address, the
address of a IPv6 sockets uses a 128-bit IPv6 address, and the address of a Unix
socket normally uses a filesystem path. These are all represented with different
underlying struct types. For instance, an IPv4 socket uses a struct sockaddr_in
, an IPv6 socket uses a struct sockaddr_in6
, and a Unix socket
uses a struct sockaddr_un
.
Since C doesn't have real polymorphism, what you do is declare a concrete type
like sockaddr_in
or sockaddr_un
and then supply bind(2)
and connect(2)
with a pointer to that struct, cast as a sockaddr *
. The true length of the
underlying socket address is given via addrlen
. In a more modern language
you'd implement this polymorphism using inheritance or some type of abstract
interface, but C doesn't have these capabilities. The addrlen
parameter can be
thought of as a clever hack to work around this limitation of the C language.
For Unix sockets on Linux, the sockaddr_un
type is defined like this:
struct sockaddr_un {
sa_family_t sun_family; /* AF_UNIX */
char sun_path[108]; /* pathname */
};
Normally you'd put a regular null-terminated filesystem path in the sun_path
field.
The "abstract" socket feature on Linux instead works like this: you set the
first byte in sun_path
to be \0
, and then put up to 107 additional bytes
after it. Then the addrlen
parameter to bind(2)
or connect(2)
is set to be
sizeof(sa_family_t)
, which is two, plus the number of bytes you put into
sun_path
, including the initial null byte.
The kernel looks at the first two bytes in the addr
pointer which always holds
a sa_family_t
representing the socket family type. If it sees AF_UNIX
, it
then computes the size of the value in sun_path
using addrlen - 2
. In this
way the kernel can explicitly tell how large the value stored in sun_path
is,
which is why using an initial null byte is possible. If the first byte in
sun_path
is zero then the kernel considers the socket name to be "abstract".
Such a socket will exist in memory in the kernel, but does not correspond to a
filesystem path.
Abstract sockets have a few interesting uses. One interesting thing about them is that they're reference counted by the kernel. For regular Unix sockets defined in the filesystem, you may need to take care to remove stale sockets from the filesystem after use. Abstract sockets don't have this problem: once an abstract socket is no longer in use by any process, the kernel automatically cleans it up.