I was talking to a coworker recently about an issue where
npm was opening too many files and was then failing
with an error related to EMFILE
. For quick reference, this is the error that
system calls that create file descriptors (e.g. open(2)
or socket(2)
) return
when the calling process has hit its file descriptor limit. I actually don't
know the details of this problem (other than that it was "solved" by increasing
a ulimit), but it reminded me of an interesting but not-well-understood topic
related to the management of resources like file descriptors in garbage
collected languages.
I plan to use this post to talk about the issue, and ways that different languages work around the issue.
The Problem
In garbage collected languages there's generally an idea of running a
"finalizer" or "destructor" when an object is collected. In the simplest case,
when the garbage collector collects an object it reclaims that object's memory
somehow. When a finalizer or destructor is associated with an object, in
addition to reclaiming the object's memory the GC will execute the code in the
destructor. For instance, in Python you can do this (with a few important
caveats about what happens when the process is exiting!) by implementing a
__del__()
method on your class. Not all languages natively support this. For
instance, JavaScript does not have a native equivalent.
While the Javascript language doesn't let you write destructors, real world JavaScript VMs do have a concept of destructors. Why? Because real JavaScript programs need to do things like perform I/O and have to interact with real world resources like browsee assets, images, files, and so forth. These constructs that are provided in the context like a DOM or in Node, but are not part of the Javascript language itself, are typically implemented within these VMs by binding against native C or C++ code.
Here's an example. Suppose that you have a Node.js program that opens a file and
does some stuff with it. Under the hood, what is happening is that the Node.js
program has created some file wrapper object that does the open(2)
system call
for you and manages the associated file descriptor, does async read/write system
calls for you, and so on. When the file object is garbage collected the file
descriptor associated with the object needs to be released to the operating
system using close(2)
. There's a mechanism in V8 to run a callback when an
object is collected, and the C++ class implementing the file object type uses this to
to add a destructor callback will handle invoking the close(2)
system call
when the file object is GC'ed.
A similar thing happens in pretty much every garbage collected language. For
instance, in Python if you have a file
object you can choose to manually
invoke the .close()
method to close the object. But if you don't do that,
that's OK too---if the garbage collector determines that the object should be
collected, it will automatically close the file if necessary in addition to
actually reclaiming the memory used by the object. This works in a similar way
to Node.js, except that instead of this logic being implemented with a V8 C++
binding it's implemented in the Python C API.
So far so good. Here's where the interesting issue is that I want to discuss. Suppose you're opening and closing lots of files really fast in Node.js or Python or some other garbage collected language. This will generate a lot of objects that need to be GC'ed. Despite there being a lot of these objects, the objects themselves are probably pretty small---just the actual object overhead plus a few bytes for the file descriptor and maybe a file name.
The garbage collector determines when it should run and actually collect objects based on a bunch of magic heuristics, but these heuristics are all related to memory pressure---e.g. how much memory it thinks the program is using, how many objects it thinks are collectable, how long it's been since a collection, or some other metric along these lines. The garbage collector itself knows how to count objects and track memory usage, but it doesn't know about extraneous resources like "file descriptors". So what happens is you can easily have hundreds or thousands of file descriptors ready to be closed, but the GC thinks that the amount of reclaimable memory is very small and thinks it doesn't need to run yet. In other words, despite being close to running out of file descriptors, the GC doesn't realize that it can help the situation by reclaiming these file objects since it's only considering memory pressure.
This can lead to situations where you get errors like EMFILE
when
instantiating new file objects, because despite your program doing the "right
thing", the GC is doing something weird.
This gets a lot more insidious with other resources. Here's a classic example.
Suppose you're writing a program in Python or Ruby or whatever else, and that
program is using some bindings to a fancy C library that does some heavy
processing for some task like linear algebra, computer vision, machine learning,
or so forth. To be concrete, let's pretend like it's using bindings to a C
library that does really optimized linear algebra on huge matrices. The bindings
will make some calls into the C library to allocate a matrix when an object is
instantiated, and likewise will have a destructor callback to deallocate the
matrix when the object is GC'ed. Well, since these are huge matrices, your
matrices could easily be hundreds of megabytes, or even many gigabytes in size,
and all of that data will actually be page faulted in and resident in memory. So
what happens is the Python GC is humming along, and it sees this PyObject
that
it thinks is really small, e.g. it might think that the object is only 100
bytes. But the reality is that object has an opaque handle to a 500 MB matrix
that was allocated by the C library, and the Python GC has no way of knowing
that, or even of knowing that there was 500 MB allocated anywhere at all! This
happens because the C library is probably using malloc(3)
or its own
allocator, and the Python VM uses
its own memory allocator). So you
can easily have a situation where the machine is low on memory, and the Python
GC has gigabytes of these matrices ready to be garbage collected, but it thinks
it's just managing a few small objects and doesn't GC them in a timely manner.
This example is a bit counterintuitive because it can appear like the language
is leaking memory when it's actually just an impedance mismatch between how the
kernel tracks memory for your process and how the VM's GC tracks memory.
Again, I don't know if this was the exact problem that my coworker had with npm, but it's an interesting thought experiment---if npm is opening and closing too many files really quickly, it's totally possible that the issue isn't actually a resource leak, but actually just related to this GC impedance mismatch.
Generally there's not a way to magically make the GC for a language know about
these issues, because there's no way that it can know about every type of
foreign resource, how important that resource is to collect, and so on.
Typically what you should do in a language like Python or Ruby or JavaScript is
to make sure that objects have an explicit close()
method or similar, and that
method will finalize the resource. Then if the developer really cares about when
the resources are released they can manually call that method. If the developer
forgets to call the close method then you can opt to do it automatically in the
destructor.
C++ Solution
C++ has a really elegant solution to this problem that I want to talk about, since it inspired a similar solution in Python that I'm also going to talk about. The solution has the cryptic name RAII---Resource Acquisition Is Initialization. In my opinion this is a really confusing name, since most people really care more about finalization than initialization, but that's what it's called.
Here's how it works. C++ has a special kind of scoping called
block scoping.
This is something that a lot of people think that syntactically similar
languages like JavaScript have, since those languages also have curly braces,
but the block scoping concept is actually totally different. Block scoping means
that variables are bound to the curly brace block they're declared in. When the
context of the closing curly brace is exited, the variable is out of scope. This
is different from JavaScript (or Python) for instance, because in JavaScript
variables are scoped by the function call they're in. So a variable declared in
a for
loop inside of a function is out of scope when the loop exits in C++,
but it is not out of scope until the function exits in JavaScript.
In addition to the block scoping concept, C++ also has a rule that says that when an object goes out of scope its destructor is called immediately. It goes even further to specify the rules about what order the destructors are invoked. If multiple objects go out of scope at once, the destructors are called in reverse order. So suppose you have a block like:
hello();
{
Foo a;
Bar b;
}
goodbye()
The following will happen, in order:
hello()
is called- the constructor for
a
is called - the constructor for
b
is called - the destructor for
b
is called - the destructor for
a
is called goodbye()
is called
This is guaranteed by the language. So here's how it works in the context of managing things like file resources. Suppose I want to have a simple wrapper for a file. I might implement code like this:
class File {
public:
File(const char *filename) :fd_(open(filename, O_RDONLY)) {}
~File() { close(fd_); }
private:
int fd_;
};
int foo() {
File my_file("hello.txt");
bar(my_file);
return baz();
}
(Note: I'm glossing over some details like handling errors returned by open(2)
and close(2)
, but that's not important for now.)
In this example, we opened the file hello.txt
, do some things with it, and
then it will automatically get closed after the call to baz()
. So what? This
doesn't seem that much better than explicitly opening the file and closing it.
In fact, it would have certainly been a lot less code to just call open(2)
at
the start of foo()
, and then had the one extra line at the end to close the
file before calling baz()
.
Well, besides being error prone, there is another problem with that approach.
What if bar()
throws an exception? If we had an explicit call to close(2)
at
the end of the function, then an exception would mean that line of code would
never be run. And that would leak the resource. The C++ RAII pattern ensures
that the file is closed when the block scope exits, so it properly handles the
case of the function ending normally, and also the case where some exception is
thrown somewhere else to cause the function to exit without a return.
The C++ solution is elegant because once we've done the work to write the class wrapping the resource, we generally never need to explicitly close things, and we also get the guarantee that the resources is always finalized and that that is done in a timely manner. And it's automatically exception safe in all cases. Of course, this only works with resources that can be scoped to the stack, but this is true a lot more often than you might suspect.
This pattern is particularly useful with mutexes and other process/thread exclusion constructs where failure to release the mutex won't just cause a resource leak but can cause your program to deadlock.
Python Context Managers
Python has a related concept called "context managers" and an associated syntax
feature called a with
statement.
I don't want to get too deep into the details of the context manager protocol,
but the basic idea is that an object can be used in a with
statement if it
implements two magic methods called __enter__()
and __exit__()
which have a
particular interface. Then when the with
statement is entered the
__enter__()
method is invoked, and when the with
statement is exited for any
reason (an exception is thrown, a return statement is encountered, or the last
line of code in the block is executed) the __exit__()
method is invoked.
Again, there are some details I'm eliding here related to exception handling,
but for the purpose of resouce management what's interesting is that this
provides a similar solution to the C++ RAII pattern. When the with
statement
is used we can ensure that objects are automatically and safely finalized by
making sure that finalization happens in the __exit__()
method.
The most important difference here compared to the C++ RAII approach is that you
must remember to use the with
statement with the object to get the context
manager semantics. With C++, RAII typically is meant to imply that the object is
allocated on the stack which means that it is automatically reclaimed and
there's no chance for you to forget to release the context.
Go Defer
Go has a syntax feature called defer
that lets you ensure that some code is
run when the current block is exited. This is rather similar to the Python
context manager approach, although syntactically it works a lot differently.
The thing that is really nice about this feature is that it lets you run any code at the time the block is exited, i.e. you can pass any arbitrary code to a defer statement. This makes the feature incredibly flexible---in fact, it is a lot more flexible than the approach that Python and C++ have.
There are a few downsides to this approach in my opinion.
The first downside is that like with Python, you have to actually remember to do it. Unlike C++, it will never happen automatically.
The second downside is that because it's so flexible, it has more potential to
be abused or used in a non-idiomatic way. In Python, if you see an object being
used in a with
statement you know that the semantics are that the object is
going to be finalized when the with
statement is exited. In Go the defer
statement probably occurs close to object initialization, but doesn't
necessarily have to.
The third downside is that the defer
statement isn't run until the function
it's defined in exits. This is less powerful than C++ (because C++ blocks don't
have to be function-scoped) and also less powerful than Pyton (because with
statements can exit before the calling function).
I don't necessarily think this construct is worse than C++ or Python, but it is important to understand how the semantics differ.
Javascript
Javascript doesn't really have a true analog of the C++/Python/Go approaches, as
far as I know. What you can do in Javascript is to use a try
statement with
a finally
clause. Then in the finally
clause you can put your call to
fileObj.close()
or whatever the actual interface is. Actually, you can also
use this approach in Python if you wish, since Python also has the try/finally
construct.
Like with Go defer statements, it is the caller's responsibility to remember to do this in every case, and if you forget to do it in one place you can have resource leaks. In a lot of ways this is less elegant than Go because the finalization semantics are separated from the initialization code, and this makes the code harder to follow in my opinion.