Large Data Files

Dealing with large data files is something of a more advanced programming topic; one that I've found that a lot of my younger coworkers don't know a lot about. I'd like to share a bit of wisdom on the topic in this file.

Let's say you have a file that is some reasonable fraction of the size of the memory on a computer. As a concrete example, let's say you have a 4GB JSON file on a machine with 8GB of RAM.

You might think it would be safe to do something like this with it in python:

with open('very_large_file.json') as large_file:
    data = json.load(large_file)
do_stuff_with(data)

This is a valid Python program, but it's probably going to end disastrously. That JSON file is 4GB of text, but when it's loaded into Python it's going to turn into a lot of integer objects and string objects and list objects and dictionary objects and who knows what. Some of those objects might be pretty big. In C frequently an int is 4 bytes, but in Python that innocuous int object is stored in a PyObject * container that is quite a bit bigger. In my experience, if you do something like this you'll end up with some small integer multiple of the size of the original file in actual memory, so something like this might requie 8-12GB of memory to load in Python after all is said in done. Some other languages like JavaScript might have slightly better memory efficiency since JSON is part of their reason d'etre, but even then the string '4' is probably much smaller than the JSON number representation of 4. It all depends on exactly what data types you're storing, how much of the JSON file is whitespace and field delimiters, and a few other things, but a safe bet here is that the loaded JSON representation is going to be much larger than 4GB, and likely will exceed the total system memory and cause swapping.

Clearly things will be even worse if you have a data file that's much bigger than your resident RAM. For instance, say you still have your 8GB of memory and have to process a data file that's 100GB. The json.load or json.loads methods you'd get out of the box clearly aren't going to work at all in this case.

Also, I'm using JSON as an example here only because it's in vogue right now, but the same class of problems is true of XML and most other data interchange formats (YAML, ASN.1, Protobuf, Thrift, etc.).

This is also a problem when generating huge files. Suppose, hypothetically, that you have a SQL database and you want to generate all of it or some subset of it into JSON or XML dump or something like that. Relational databases are designed for handling large data sets much larger than resident memroy, so it's common to run be in a situation where a naive program will exceed all of the memory on the system and cause swapping or an OOM situation when generating a large data file.

Fortunately, a long time ago smart people thought of a way to handle this---generative parsing. People generally use the Java/XML terminology here because parsing and generating big XML feeds used to be a big thing (and still is), but the model here applies to any programming language.

Two two dominant paradigms or parsing or generating a big XML/JSON/whatever are called the DOM or ElementTree model, and the SAX model. In particular SAX was I believe the name of a particular implementation of an event-driven XML parser/generator library.

In the DOM or ElementTree model you get an object that represents the whole document and you just do what you want with it. This is analagous to working with the DOM in browserside JavaScript. It's also what you get when you use json.loads or json.dumps in Python with JSON, or what you'd get with lxml.etree in Python with XML. It's definitely the easiest model to work with, because you just have a big object and access methods on it the usual way. This model is simple, but only works well when you have relatively small data sets.

The SAX model is totally different. The basic idea is that when parsing, you get a callback for each token encountered in the document. Keeping state is up to you. Same with generating a document: you call a bunch of methods that emit tokens, and the parser will handle turning that into valid JSON/XML/whatever. If the library you're using is in an object-oriented language like C++, Python, or Java you'll often create a subclass using inheritance and overload methods on the class, whereas in C you'll generally just register callbacks with function pointers. In either case the idea is pretty much the same.

I ran into this recently where I had to generate a huge data feed that is currently going to be about 1GB per data file, but potentially might be much larger in the future (up to 1TB for now). The hosts this script runs on actually have a fair amount of free memory, so I could have just done things the simple way; but as they say, "every byte is sacred", and I wanted my program to be able to generate and parse huge data files efficiently.

The first version of the program I wrote emitted all of the data into a CSV file. CSV is great because you can just emit one line at a time, so it's not memory hungry if you write your code iteratively. Same with parsing---you can parse one line at a time, so if you write your code correctly it will be efficient. However, I ran into a bunch of problems:

So I decided to encode the dump as a hierarchical JSON file. The format is basically like:

{
  'timestamp': 1443404160,
  'hostname': 'foo',
  'databases': {
    'foo': {
      'tables': {
        'bar': [...]
      }
    },
    'quux': {
      'tables': {
        'quux1': [...],
        'quux2': [...]
      }
    }
  }
}

As you can see, it's easy to extend this to add more metadata, handle multiple databases, handle multiple tables, etc.

In C I've previously used YAJL. YAJL is awesome. It's really fast, and it's really easy to use. It also gives you a lot of knobs, so for instance you can control memory usage vs encoding/decoding efficiency.

In Python I didn't really find any good solutions, until I stumbled across yajl-py. Yajl-py is a really simple wrapper around YAJL that uses ctypes. If you've already uses YAJL, or if you're already used to SAX style parsers and generators it's a great solution, and you'll pick it up in minutes.

I have nothing but positive words to say about the yajl-py project, and the maintainer has been friendly and accepting as I've contributed a few small fixes and patches. I've also found the speed to be very good.

Note: I could have also used lxml for this project to instead generate XML, since lxml already has a SAX style interface, is very fast (it uses the libxml2 and libxml2 is itself very performant). I mostly decided to use JSON because I prefer it as a file format (at least for the type of data that I needed), a lot of younger programmers are more familiar with JSON, and I liked the control I got with choosing how the buffering and flushing would work with YAJL (although there may be an option for this in lxml, I didn't check too hard).