Resource Management in Garbage Collected Languages

I was talking to a coworker recently about an issue where npm was opening too many files and was then failing with an error related to EMFILE. For quick reference, this is the error that system calls that create file descriptors (e.g. open(2) or socket(2)) return when the calling process has hit its file descriptor limit. I actually don't know the details of this problem (other than that it was "solved" by increasing a ulimit), but it reminded me of an interesting but not-well-understood topic related to the management of resources like file descriptors in garbage collected languages.

I plan to use this post to talk about the issue, and ways that different languages work around the issue.

The Problem

In garbage collected languages there's generally an idea of running a "finalizer" or "destructor" when an object is collected. In the simplest case, when the garbage collector collects an object it reclaims that object's memory somehow. When a finalizer or destructor is associated with an object, in addition to reclaiming the object's memory the GC will execute the code in the destructor. For instance, in Python you can do this (with a few important caveats about what happens when the process is exiting!) by implementing a __del__() method on your class. Not all languages natively support this. For instance, JavaScript does not have a native equivalent.

While the Javascript language doesn't let you write destructors, real world JavaScript VMs do have a concept of destructors. Why? Because real JavaScript programs need to do things like perform I/O and have to interact with real world resources like browsee assets, images, files, and so forth. These constructs that are provided in the context like a DOM or in Node, but are not part of the Javascript language itself, are typically implemented within these VMs by binding against native C or C++ code.

Here's an example. Suppose that you have a Node.js program that opens a file and does some stuff with it. Under the hood, what is happening is that the Node.js program has created some file wrapper object that does the open(2) system call for you and manages the associated file descriptor, does async read/write system calls for you, and so on. When the file object is garbage collected the file descriptor associated with the object needs to be released to the operating system using close(2). There's a mechanism in V8 to run a callback when an object is collected, and the C++ class implementing the file object type uses this to to add a destructor callback will handle invoking the close(2) system call when the file object is GC'ed.

A similar thing happens in pretty much every garbage collected language. For instance, in Python if you have a file object you can choose to manually invoke the .close() method to close the object. But if you don't do that, that's OK too---if the garbage collector determines that the object should be collected, it will automatically close the file if necessary in addition to actually reclaiming the memory used by the object. This works in a similar way to Node.js, except that instead of this logic being implemented with a V8 C++ binding it's implemented in the Python C API.

So far so good. Here's where the interesting issue is that I want to discuss. Suppose you're opening and closing lots of files really fast in Node.js or Python or some other garbage collected language. This will generate a lot of objects that need to be GC'ed. Despite there being a lot of these objects, the objects themselves are probably pretty small---just the actual object overhead plus a few bytes for the file descriptor and maybe a file name.

The garbage collector determines when it should run and actually collect objects based on a bunch of magic heuristics, but these heuristics are all related to memory pressure---e.g. how much memory it thinks the program is using, how many objects it thinks are collectable, how long it's been since a collection, or some other metric along these lines. The garbage collector itself knows how to count objects and track memory usage, but it doesn't know about extraneous resources like "file descriptors". So what happens is you can easily have hundreds or thousands of file descriptors ready to be closed, but the GC thinks that the amount of reclaimable memory is very small and thinks it doesn't need to run yet. In other words, despite being close to running out of file descriptors, the GC doesn't realize that it can help the situation by reclaiming these file objects since it's only considering memory pressure.

This can lead to situations where you get errors like EMFILE when instantiating new file objects, because despite your program doing the "right thing", the GC is doing something weird.

This gets a lot more insidious with other resources. Here's a classic example. Suppose you're writing a program in Python or Ruby or whatever else, and that program is using some bindings to a fancy C library that does some heavy processing for some task like linear algebra, computer vision, machine learning, or so forth. To be concrete, let's pretend like it's using bindings to a C library that does really optimized linear algebra on huge matrices. The bindings will make some calls into the C library to allocate a matrix when an object is instantiated, and likewise will have a destructor callback to deallocate the matrix when the object is GC'ed. Well, since these are huge matrices, your matrices could easily be hundreds of megabytes, or even many gigabytes in size, and all of that data will actually be page faulted in and resident in memory. So what happens is the Python GC is humming along, and it sees this PyObject that it thinks is really small, e.g. it might think that the object is only 100 bytes. But the reality is that object has an opaque handle to a 500 MB matrix that was allocated by the C library, and the Python GC has no way of knowing that, or even of knowing that there was 500 MB allocated anywhere at all! This happens because the C library is probably using malloc(3) or its own allocator, and the Python VM uses its own memory allocator). So you can easily have a situation where the machine is low on memory, and the Python GC has gigabytes of these matrices ready to be garbage collected, but it thinks it's just managing a few small objects and doesn't GC them in a timely manner. This example is a bit counterintuitive because it can appear like the language is leaking memory when it's actually just an impedance mismatch between how the kernel tracks memory for your process and how the VM's GC tracks memory.

Again, I don't know if this was the exact problem that my coworker had with npm, but it's an interesting thought experiment---if npm is opening and closing too many files really quickly, it's totally possible that the issue isn't actually a resource leak, but actually just related to this GC impedance mismatch.

Generally there's not a way to magically make the GC for a language know about these issues, because there's no way that it can know about every type of foreign resource, how important that resource is to collect, and so on. Typically what you should do in a language like Python or Ruby or JavaScript is to make sure that objects have an explicit close() method or similar, and that method will finalize the resource. Then if the developer really cares about when the resources are released they can manually call that method. If the developer forgets to call the close method then you can opt to do it automatically in the destructor.

C++ Solution

C++ has a really elegant solution to this problem that I want to talk about, since it inspired a similar solution in Python that I'm also going to talk about. The solution has the cryptic name RAII---Resource Acquisition Is Initialization. In my opinion this is a really confusing name, since most people really care more about finalization than initialization, but that's what it's called.

Here's how it works. C++ has a special kind of scoping called block scoping. This is something that a lot of people think that syntactically similar languages like JavaScript have, since those languages also have curly braces, but the block scoping concept is actually totally different. Block scoping means that variables are bound to the curly brace block they're declared in. When the context of the closing curly brace is exited, the variable is out of scope. This is different from JavaScript (or Python) for instance, because in JavaScript variables are scoped by the function call they're in. So a variable declared in a for loop inside of a function is out of scope when the loop exits in C++, but it is not out of scope until the function exits in JavaScript.

In addition to the block scoping concept, C++ also has a rule that says that when an object goes out of scope its destructor is called immediately. It goes even further to specify the rules about what order the destructors are invoked. If multiple objects go out of scope at once, the destructors are called in reverse order. So suppose you have a block like:

hello();
{
  Foo a;
  Bar b;
}
goodbye()

The following will happen, in order:

This is guaranteed by the language. So here's how it works in the context of managing things like file resources. Suppose I want to have a simple wrapper for a file. I might implement code like this:

class File {
 public:
  File(const char *filename) :fd_(open(filename, O_RDONLY)) {}
  ~File() { close(fd_); }

 private:
  int fd_;
};

int foo() {
  File my_file("hello.txt");
  bar(my_file);
  return baz();
}

(Note: I'm glossing over some details like handling errors returned by open(2) and close(2), but that's not important for now.)

In this example, we opened the file hello.txt, do some things with it, and then it will automatically get closed after the call to baz(). So what? This doesn't seem that much better than explicitly opening the file and closing it. In fact, it would have certainly been a lot less code to just call open(2) at the start of foo(), and then had the one extra line at the end to close the file before calling baz().

Well, besides being error prone, there is another problem with that approach. What if bar() throws an exception? If we had an explicit call to close(2) at the end of the function, then an exception would mean that line of code would never be run. And that would leak the resource. The C++ RAII pattern ensures that the file is closed when the block scope exits, so it properly handles the case of the function ending normally, and also the case where some exception is thrown somewhere else to cause the function to exit without a return.

The C++ solution is elegant because once we've done the work to write the class wrapping the resource, we generally never need to explicitly close things, and we also get the guarantee that the resources is always finalized and that that is done in a timely manner. And it's automatically exception safe in all cases. Of course, this only works with resources that can be scoped to the stack, but this is true a lot more often than you might suspect.

This pattern is particularly useful with mutexes and other process/thread exclusion constructs where failure to release the mutex won't just cause a resource leak but can cause your program to deadlock.

Python Context Managers

Python has a related concept called "context managers" and an associated syntax feature called a with statement.

I don't want to get too deep into the details of the context manager protocol, but the basic idea is that an object can be used in a with statement if it implements two magic methods called __enter__() and __exit__() which have a particular interface. Then when the with statement is entered the __enter__() method is invoked, and when the with statement is exited for any reason (an exception is thrown, a return statement is encountered, or the last line of code in the block is executed) the __exit__() method is invoked. Again, there are some details I'm eliding here related to exception handling, but for the purpose of resouce management what's interesting is that this provides a similar solution to the C++ RAII pattern. When the with statement is used we can ensure that objects are automatically and safely finalized by making sure that finalization happens in the __exit__() method.

The most important difference here compared to the C++ RAII approach is that you must remember to use the with statement with the object to get the context manager semantics. With C++, RAII typically is meant to imply that the object is allocated on the stack which means that it is automatically reclaimed and there's no chance for you to forget to release the context.

Go Defer

Go has a syntax feature called defer that lets you ensure that some code is run when the current block is exited. This is rather similar to the Python context manager approach, although syntactically it works a lot differently.

The thing that is really nice about this feature is that it lets you run any code at the time the block is exited, i.e. you can pass any arbitrary code to a defer statement. This makes the feature incredibly flexible---in fact, it is a lot more flexible than the approach that Python and C++ have.

There are a few downsides to this approach in my opinion.

The first downside is that like with Python, you have to actually remember to do it. Unlike C++, it will never happen automatically.

The second downside is that because it's so flexible, it has more potential to be abused or used in a non-idiomatic way. In Python, if you see an object being used in a with statement you know that the semantics are that the object is going to be finalized when the with statement is exited. In Go the defer statement probably occurs close to object initialization, but doesn't necessarily have to.

The third downside is that the defer statement isn't run until the function it's defined in exits. This is less powerful than C++ (because C++ blocks don't have to be function-scoped) and also less powerful than Pyton (because with statements can exit before the calling function).

I don't necessarily think this construct is worse than C++ or Python, but it is important to understand how the semantics differ.

Javascript

Javascript doesn't really have a true analog of the C++/Python/Go approaches, as far as I know. What you can do in Javascript is to use a try statement with a finally clause. Then in the finally clause you can put your call to fileObj.close() or whatever the actual interface is. Actually, you can also use this approach in Python if you wish, since Python also has the try/finally construct.

Like with Go defer statements, it is the caller's responsibility to remember to do this in every case, and if you forget to do it in one place you can have resource leaks. In a lot of ways this is less elegant than Go because the finalization semantics are separated from the initialization code, and this makes the code harder to follow in my opinion.