PATH_MAX Is Tricky

April 25, 2017

One of the defining features of Unix is its hierarchical filesystem: directories on Unix systems can contain other directories, without a limit to the depth of the nesting. This isn't a big deal nowadays, but Unix was one of the first operating systems to feature a hierarchical filesystem. And believe it or not, developers today are still writing code today that misuses PATH_MAX, and doesn't handle long file paths correctly.

This constant is defined by POSIX, and is supposed to be the largest possible size for a filesystem path. There a few compelling reasons to define such a limit:

Fixing the size of paths makes it easier to declare file paths inline in structs or on the stack, simplifying manual memory management.
In practice most filesystems have a limit on the length of filenames, so it makes sense to expose this limit somehow.

Linux defines PATH_MAX as 4096 bytes.

The problem is that you can't meaningfully define a constant like this in a header file. The maximum path size is actually to be something like a filesystem limitation, or at the very least a kernel parameter. This means that it's a dynamic value, not something preordained. The <limits.h> header file doesn't know what filesystems you're trying to use, or what kernel you're running, it just exports a static value. For this reason alone we know that the value of PATH_MAX is at best a lower bound.

File Paths Can Be Arbitrarily Long

Most filesystems will have some limits on files, such as a maximum length on components in a file path. The limit on path components (also known as the file name limit) is defined as NAME_MAX, generally 255 bytes. But a file path can have many components, and thus a full path can be much longer. Unix filesystems have directory inodes that map relative file names to file inodes, and file inodes do not actually contain file names at all.

Unix also allows filesystems to be mounted hierarchically. Even if a hypothetical filesystem had a size limitation on the length of full path names (and none of the mainstream ones do), that filesystem could be mounted at a mount point other than /. The mounted filenames would all be exposed with the mount point as a prefix, and thus their full file names would become longer than what the underlying filesystem supported!

Exercise for the reader: Consider how one could implement hard links on a Unix system, and why hard links preclude storing full file paths in inodes.

System Calls and `ENAMETOOLONG`

As a practical consideration, the kernel must enforce a limit on the length of all strings supplied via system calls. There are a couple of reasons for this, but the most important is that the kernel must actually do a memory copy of non-value parameters like strings from userspace into kernel memory, e.g. using copy_from_user(). For system call arguments that are file names, the kernel will return ENAMETOOLONG if the supplied file name is too long. On Linux, system calls like open(2) perform this check via getname_flags(). The check is performed using the 4096 byte PATH_MAX value, which, as I just finished explaining, is not really a file path limit!

PATH_MAX is actually defined as the maximum permitted size of file paths supplied via system calls. If you try to open a path whose length equals or exceeds 4096 bytes, you'll get an error. But that doesn't mean it's impossible to open such a file: it just means that you need to use a shorter (relative) file path when opening the file.

Many functions defined in libc can accept or return file names, and those file names are not necessarily limited by the size of PATH_MAX.

Path Metadata With `pathconf()`

To make things more sane, POSIX defines a less well-known symbol called pathconf(), analogous to sysconf(). This function lets you get at low-level information about kernel limits related to things like path lengths:

// POSIX method for getting file path metadata.
long pathconf(const char *path, int name);

The maximum relative path name can be fetched for a path by supplying the value _PC_PATH_MAX as the second argument. There are a few important usage caveats here.

The first thing you'll notice when using this API is that pathconf() takes a file path as its argument. Thus you can't use pathconf() to get the maximum file path for arbitrary files, because there isn't an arbitrary limit. You can only use pathconf() with a file whose name you already know.

When you do know the filename, the return value for pathconf() with _PC_PATH_MAX is the maximum relative path size, since Unix files don't really have absolute paths. Therefore the data returned with _PC_PATH_MAX is not as general as what you might think at first: most code will need the ability to handle longer paths than what pathconf() returns.

What Directory Am I In?

The evolution of the Unix filesystem APIs is illustrative of how to properly deal with long file paths. I'm going to use accessing the current directory as an example of how things have changed. Back in ancient times, you would have used getwd() to get the current directory name:

// Deprecated old-school Unix way of getting the current working directory.
char *getwd(char *buf);

The buffer supplied to getwd() is supposed to be at least PATH_MAX bytes in length. This will fail in a bunch of cases, since PATH_MAX isn't a reliable way to tell the maximum length of a directory name. This was fixed by introducing a new, more general method, called getcwd(). The major difference is that it accepts another parameter indicating the buffer size:

// Current POSIX way of getting the current directory; does not allocate.
char *getcwd(char *buf, size_t size);

If the buffer you supply is too small, getcwd() will return -1 and set errno to ENAMETOOLONG. Since paths can be of arbitrary size, to correctly use getcwd() you actually need a loop that resizes the underlying buffer and retries when this happens.

The POSIX specification for getcwd() says the behavior is undefined if buf is a null pointer. GNU libc takes advantage of this by turning getcwd() into an allocating version when a null pointer is supplied as the buffer. To simplify this further, GNU libc defines an extension called get_current_dir_name() that takes no parameters, and just returns a newly-allocated directory name for you:

// GNU extension, caller must call free() after.
char *get_current_dir_name(void);

The GNU libc implementation of get_current_dir_name() is actually implemented by calling getcwd() with a null pointer. Portable code can check to see if get_current_dir_name() is available, and then fall back to a loop that uses getcwd() if necessary.

Canonicalizing File Names

Not everything in POSIX has been updated for compatibility with long file names. POSIX defines a function called realpath(), which can be used to get a "canonical" path for a file, i.e. one that doesn't include extra slashes or dots. It takes a path to canonicalize, and an output buffer to store the canonical path in:

// POSIX way to get the "canonical" path for a file.
char *realpath(const char *path, char *resolved_path);

The caller is supposed to supply an output buffer to realpath() whose size is at least PATH_MAX bytes. As we know, this isn't sufficient. Unlike the previous example of getting the current directory, POSIX doesn't define a version of realpath() that specifies the size of the buffer the return value should be copied into.

As in the previous example, POSIX does not specify implementation behavior when a null pointer is used as the output parameter. As before, GNU libc takes advantage of this to turn realpath() into an allocating version when the second parameter is null. When using this interface, you supply a null value for resolved_path, and then later you're expected to call free() on the returned pointer. To make things even easier, GNU libc exports a non-standard function called canonicalize_file_name() which is like realpath(), but only takes one argument, the file path to be resolved. Under the hood, canonicalize_file_name() is implemented by just calling realpath() with the supplied path, and a null parameter for the resolved path. This is exactly analogous to the previous example, and is a good example of how intentional ambiguity in the POSIX specification allows vendor extensions.

The GNU libc man page for realpath() has some interesting notes here about these issues and the vagaries of PATH_MAX and pathconf(). The source code is also interesting: the GNU libc source code for realpath() is nearly 200 lines long, which includes the logic for computing the right buffer size, and a lot of very careful error handling. This code also demonstrates the correct usage of using pathconf() with _PC_PATH_MAX. Take a gander if you're interested in seeing very portable, correct C file handling code.