One of the defining features of Unix is its hierarchical filesystem: directories
on Unix systems can contain other directories, without a limit to the depth of
the nesting. This isn't a big deal nowadays, but Unix was one of the first
operating systems to feature a hierarchical filesystem. And believe it or not,
developers today are still writing code today that
misuses
PATH_MAX
,
and doesn't handle long file paths correctly.
This constant is defined by POSIX, and is supposed to be the largest possible size for a filesystem path. There a few compelling reasons to define such a limit:
- Fixing the size of paths makes it easier to declare file paths inline in structs or on the stack, simplifying manual memory management.
- In practice most filesystems have a limit on the length of filenames, so it makes sense to expose this limit somehow.
Linux defines PATH_MAX
as
4096 bytes.
The problem is that you can't meaningfully define a constant like this in a
header file. The maximum path size is actually to be something like a filesystem
limitation, or at the very least a kernel parameter. This means that it's a
dynamic value, not something preordained. The <limits.h>
header file doesn't
know what filesystems you're trying to use, or what kernel you're running, it
just exports a static value. For this reason alone we know that the value of
PATH_MAX
is at best a lower bound.
File Paths Can Be Arbitrarily Long
Most filesystems will have some limits on files, such as a maximum length on
components in a file path. The limit on path components (also known as the file
name limit) is defined as NAME_MAX
, generally 255 bytes. But a file path can
have many components, and thus a full path can be much longer. Unix filesystems
have directory inodes that map relative file names to file inodes, and file
inodes
do not actually contain file names at all.
Unix also allows filesystems to be mounted hierarchically. Even if a
hypothetical filesystem had a size limitation on the length of full path names
(and none of the mainstream ones do), that filesystem could be mounted at a
mount point other than /
. The mounted filenames would all be exposed with the
mount point as a prefix, and thus their full file names would become longer than
what the underlying filesystem supported!
Exercise for the reader: Consider how one could implement hard links on a Unix system, and why hard links preclude storing full file paths in inodes.
System Calls and ENAMETOOLONG
As a practical consideration, the kernel must enforce a limit on the length of
all strings supplied via system calls. There are a couple of reasons for this,
but the most important is that the kernel must actually do a memory copy of
non-value parameters like strings from userspace into kernel memory, e.g.
using
copy_from_user()
.
For system call arguments that are file names, the kernel will return
ENAMETOOLONG
if the supplied file name is too long. On Linux, system calls
like open(2)
perform this check
via getname_flags()
.
The check is performed using the 4096 byte PATH_MAX
value, which, as I just
finished explaining, is not really a file path limit!
PATH_MAX
is actually defined as the maximum permitted size of file paths
supplied via system calls. If you try to open a path whose length equals or
exceeds 4096 bytes, you'll get an error. But that doesn't mean it's impossible
to open such a file: it just means that you need to use a shorter (relative)
file path when opening the file.
Many functions defined in libc can accept or return file names, and those file
names are not necessarily limited by the size of PATH_MAX
.
Path Metadata With pathconf()
To make things more sane, POSIX defines a less well-known symbol
called
pathconf()
,
analogous to sysconf()
.
This function lets you get at low-level information about kernel limits related
to things like path lengths:
// POSIX method for getting file path metadata.
long pathconf(const char *path, int name);
The maximum relative path name can be fetched for a path by supplying the value
_PC_PATH_MAX
as the second argument. There are a few important usage caveats
here.
The first thing you'll notice when using this API is that pathconf()
takes a
file path as its argument. Thus you can't use pathconf()
to get the maximum
file path for arbitrary files, because there isn't an arbitrary limit. You can
only use pathconf()
with a file whose name you already know.
When you do know the filename, the return value for pathconf()
with
_PC_PATH_MAX
is the maximum relative path size, since Unix files don't
really have absolute paths. Therefore the data returned with _PC_PATH_MAX
is
not as general as what you might think at first: most code will need the ability
to handle longer paths than what pathconf()
returns.
What Directory Am I In?
The evolution of the Unix filesystem APIs is illustrative of how to properly
deal with long file paths. I'm going to use accessing the current directory as
an example of how things have changed. Back in ancient times, you would have
used getwd()
to get the current directory name:
// Deprecated old-school Unix way of getting the current working directory.
char *getwd(char *buf);
The buffer supplied to getwd()
is supposed to be at least PATH_MAX
bytes in
length. This will fail in a bunch of cases, since PATH_MAX
isn't a reliable
way to tell the maximum length of a directory name. This was fixed by
introducing a new, more general method, called getcwd()
. The major difference
is that it accepts another parameter indicating the buffer size:
// Current POSIX way of getting the current directory; does not allocate.
char *getcwd(char *buf, size_t size);
If the buffer you supply is too small, getcwd()
will return -1 and set errno
to ENAMETOOLONG
. Since paths can be of arbitrary size, to correctly use
getcwd()
you actually need a loop that resizes the underlying buffer and
retries when this happens.
The POSIX specification for
getcwd()
says the behavior is undefined if buf
is a null pointer.
GNU libc takes advantage of this by turning getcwd()
into an allocating
version when a null pointer is supplied as the buffer. To simplify this further,
GNU libc defines an extension called get_current_dir_name()
that takes no
parameters, and just returns a newly-allocated directory name for you:
// GNU extension, caller must call free() after.
char *get_current_dir_name(void);
The GNU libc implementation of get_current_dir_name()
is
actually implemented by calling getcwd()
with a null pointer.
Portable code can check to see if get_current_dir_name()
is available, and
then fall back to a loop that uses getcwd()
if necessary.
Canonicalizing File Names
Not everything in POSIX has been updated for compatibility with long file names.
POSIX defines a function called realpath()
, which can be used to get
a "canonical" path for a file,
i.e. one that doesn't include extra slashes or dots. It takes a path to
canonicalize, and an output buffer to store the canonical path in:
// POSIX way to get the "canonical" path for a file.
char *realpath(const char *path, char *resolved_path);
The caller is supposed to supply an output buffer to realpath()
whose size is
at least PATH_MAX
bytes. As we know, this isn't sufficient. Unlike the
previous example of getting the current directory, POSIX doesn't define a
version of realpath()
that specifies the size of the buffer the return value
should be copied into.
As in the previous example, POSIX does not specify implementation
behavior
when a null pointer is used as the output parameter.
As before, GNU libc takes advantage of this to turn realpath()
into an
allocating version when the second parameter is null. When using this interface,
you supply a null value for resolved_path
, and then later you're expected to
call free()
on the returned pointer. To make things even easier, GNU libc
exports a non-standard function called canonicalize_file_name()
which is like
realpath()
, but only takes one argument, the file path to be resolved. Under
the hood, canonicalize_file_name()
is implemented by just calling realpath()
with the supplied path, and a null parameter for the resolved path. This is
exactly analogous to the previous example, and is a good example of how
intentional ambiguity in the POSIX specification allows vendor extensions.
The GNU libc man page
for realpath()
has some interesting notes here about these issues and the
vagaries of PATH_MAX
and pathconf()
. The source code is also interesting:
the GNU libc source code for
realpath()
is nearly 200 lines long,
which includes the logic for computing the right buffer size, and a lot of very
careful error handling. This code also demonstrates the correct usage of using
pathconf()
with _PC_PATH_MAX
. Take a gander if you're interested in seeing
very portable, correct C file handling code.