Recently I was writing some code where I wanted to wait for a child process, and I wanted the wait call to have a timeout. The use case is something like this: you spawn a subprocess, and you expect the subprocess to complete within ten seconds. If it doesn't complete in that time, you want to treat it as an error (and perhaps kill the child).
There are a lot of "wait" system calls on Linux. In section 2 of the Linux man pages you get all of the following:
pid_t wait(int *wstatus);
pid_t waitpid(pid_t pid, int *wstatus, int options);
int waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options);
pid_t wait3(int *wstatus, int options, struct rusage *rusage);
pid_t wait4(pid_t pid, int *wstatus, int options, struct rusage *rusage);
Wow, look at all of those ways to wait for a process! In reality, a lot of these
are wrapper methods required by POSIX, or provided by glibc. Only two of these
are true system calls on Linux: waitid()
and wait4()
. But that's still a lot
of ways to wait for things.
As you can see, none of these accept a timeout. I Googled the question, and
found a Stack Overflow post
titled
"Waitpid equivalent with timeout?".
The top answer suggests using alarm()
, which apparently is wrong. Of course
it's wrong, if you've done a lot of Unix systems programming you'll know that
alarm()
is always the wrong answer. Then there are numerous other answers that
go into crazy gymnastics to solve the problem. Modern Linux systems have a
system call
called signalfd()
which allows you to register a file descriptor to receive signal events. With
this technique, you can register a signalfd for SIGCHLD events, and then put it
into an epoll or select loop with a timeout. This is a lot simpler than the
other Stack Overflow answers, but is still kind of complicated. Furthermore,
signalfd()
wasn't added to Linux until 2007, with kernel 2.6.22. This is
certainly old enough for pretty much all real world running Linux applications,
but it's not a standard Unix feature and therefore isn't portable. On classic
Unix systems you need to resort to the kind of tricks in the Stack Overflow
post.
These poor API decisions come up more frequently in Unix than people like to admit. In fact, the reason there are so many "wait" calls available is because the original APIs were poorly designed, and had to be modified to be more flexible.
Things get even worse with a lot of the blocking I/O system calls. For instance, suppose you want to create a directory. You get two choices:
int mkdir(const char *pathname, mode_t mode);
int mkdirat(int dirfd, const char *pathname, mode_t mode);
Neither of these takes a timeout, and neither of them exposes an interface that
can be used with select or epoll. If you've ever had the mispleasure of reading
the libuv source code you'll know that there's a trick to
turning these kinds of I/O operations into something you can put into an event
loop: you run the desired operation (mkdir()
in this case) in another thread,
and then wait for the thread to finish with a timeout. I've been told it's
common practice for vendors of things like NFS hardware appliances to patch the
kernel (most likely BSD in this case) to add new system calls to make it
possible to implement operations like this natively.
Hopefully one day we can redo all of the I/O stuff in Unix. But I'm not holding my breath.
Update: Henrique Almeida sent me the following email telling me about some system calls I was not familiar with:
Hello, if you need to wait with a timeout for a child to exit I think you should use sigprocmask with either pselect or sigtimedwait and wait for SIGCHLD. You don't need signalfd or alarm.
After looking into this I see there's also an epoll_pwait()
, which is the
equivalent call for the epoll family. So it looks like there are a lot of
options for waiting with a timeout. Things are still a mess from the
asynchronous I/O side of things, however.