Large Data Files

October 5, 2015

Dealing with large data files is something of a more advanced programming topic; one that I've found that a lot of my younger coworkers don't know a lot about. I'd like to share a bit of wisdom on the topic in this file.

Let's say you have a file that is some reasonable fraction of the size of the memory on a computer. As a concrete example, let's say you have a 4GB JSON file on a machine with 8GB of RAM.

You might think it would be safe to do something like this with it in python:

with open('very_large_file.json') as large_file:
    data = json.load(large_file)
do_stuff_with(data)

This is a valid Python program, but it's probably going to end disastrously. That JSON file is 4GB of text, but when it's loaded into Python it's going to turn into a lot of integer objects and string objects and list objects and dictionary objects and who knows what. Some of those objects might be pretty big. In C frequently an int is 4 bytes, but in Python that innocuous int object is stored in a PyObject * container that is quite a bit bigger. In my experience, if you do something like this you'll end up with some small integer multiple of the size of the original file in actual memory, so something like this might requie 8-12GB of memory to load in Python after all is said in done. Some other languages like JavaScript might have slightly better memory efficiency since JSON is part of their reason d'etre, but even then the string '4' is probably much smaller than the JSON number representation of 4. It all depends on exactly what data types you're storing, how much of the JSON file is whitespace and field delimiters, and a few other things, but a safe bet here is that the loaded JSON representation is going to be much larger than 4GB, and likely will exceed the total system memory and cause swapping.

Clearly things will be even worse if you have a data file that's much bigger than your resident RAM. For instance, say you still have your 8GB of memory and have to process a data file that's 100GB. The json.load or json.loads methods you'd get out of the box clearly aren't going to work at all in this case.

Also, I'm using JSON as an example here only because it's in vogue right now, but the same class of problems is true of XML and most other data interchange formats (YAML, ASN.1, Protobuf, Thrift, etc.).

This is also a problem when generating huge files. Suppose, hypothetically, that you have a SQL database and you want to generate all of it or some subset of it into JSON or XML dump or something like that. Relational databases are designed for handling large data sets much larger than resident memroy, so it's common to run be in a situation where a naive program will exceed all of the memory on the system and cause swapping or an OOM situation when generating a large data file.

Fortunately, a long time ago smart people thought of a way to handle this---generative parsing. People generally use the Java/XML terminology here because parsing and generating big XML feeds used to be a big thing (and still is), but the model here applies to any programming language.

Two two dominant paradigms or parsing or generating a big XML/JSON/whatever are called the DOM or ElementTree model, and the SAX model. In particular SAX was I believe the name of a particular implementation of an event-driven XML parser/generator library.

In the DOM or ElementTree model you get an object that represents the whole document and you just do what you want with it. This is analagous to working with the DOM in browserside JavaScript. It's also what you get when you use json.loads or json.dumps in Python with JSON, or what you'd get with lxml.etree in Python with XML. It's definitely the easiest model to work with, because you just have a big object and access methods on it the usual way. This model is simple, but only works well when you have relatively small data sets.

The SAX model is totally different. The basic idea is that when parsing, you get a callback for each token encountered in the document. Keeping state is up to you. Same with generating a document: you call a bunch of methods that emit tokens, and the parser will handle turning that into valid JSON/XML/whatever. If the library you're using is in an object-oriented language like C++, Python, or Java you'll often create a subclass using inheritance and overload methods on the class, whereas in C you'll generally just register callbacks with function pointers. In either case the idea is pretty much the same.

I ran into this recently where I had to generate a huge data feed that is currently going to be about 1GB per data file, but potentially might be much larger in the future (up to 1TB for now). The hosts this script runs on actually have a fair amount of free memory, so I could have just done things the simple way; but as they say, "every byte is sacred", and I wanted my program to be able to generate and parse huge data files efficiently.

The first version of the program I wrote emitted all of the data into a CSV file. CSV is great because you can just emit one line at a time, so it's not memory hungry if you write your code iteratively. Same with parsing---you can parse one line at a time, so if you write your code correctly it will be efficient. However, I ran into a bunch of problems:

CSV has poor (or at least, non-standard) escaping mechanisms, and some of the data I had to include was binary or had weird characters; so I ended up hex and base64 encoding a lot of text.
CSV has no notion of types, so when parsing on the way out you need to know that at column 4 even though you see strings like '12345' and '666' they actually represent the integer numbers 12345 and 666.
The real kicker for me was that the originally flat data that I was generated ended up being somewhat hierarchical. What I was doing was dumping a sampled subset of one database and one table in that database. At first the database and table went in the filename of the .csv file and the column format was already well known. Later, I realized that I'm going to have to end up dumping multiple databases and multiple tables within each database. So my options were:
- Stay with one generated .csv file and add a column for the database name and a column for the table name. This would work, but would result in a huge amount of data size bloat for a tiny amount of metadata, since the cardinality of those columns would be very low but they'd always be repeated.
- Keep the original file format and create multiple CSV files for each database/table dumped, with the name encoded as before.
The more I thought about it, the more I realized that I sort of wanted to add some other metadata (e.g the hostname of the host the dump came from), and trying to encode all of that data in the filename, or adding in all of the data bloat from the extra fields, made me realize that CSV was the wrong answer.

So I decided to encode the dump as a hierarchical JSON file. The format is basically like:

{
  'timestamp': 1443404160,
  'hostname': 'foo',
  'databases': {
    'foo': {
      'tables': {
        'bar': [...]
      }
    },
    'quux': {
      'tables': {
        'quux1': [...],
        'quux2': [...]
      }
    }
  }
}

As you can see, it's easy to extend this to add more metadata, handle multiple databases, handle multiple tables, etc.

In C I've previously used YAJL. YAJL is awesome. It's really fast, and it's really easy to use. It also gives you a lot of knobs, so for instance you can control memory usage vs encoding/decoding efficiency.

In Python I didn't really find any good solutions, until I stumbled across yajl-py. Yajl-py is a really simple wrapper around YAJL that uses ctypes. If you've already uses YAJL, or if you're already used to SAX style parsers and generators it's a great solution, and you'll pick it up in minutes.

I have nothing but positive words to say about the yajl-py project, and the maintainer has been friendly and accepting as I've contributed a few small fixes and patches. I've also found the speed to be very good.

Note: I could have also used lxml for this project to instead generate XML, since lxml already has a SAX style interface, is very fast (it uses the libxml2 and libxml2 is itself very performant). I mostly decided to use JSON because I prefer it as a file format (at least for the type of data that I needed), a lot of younger programmers are more familiar with JSON, and I liked the control I got with choosing how the buffering and flushing would work with YAJL (although there may be an option for this in lxml, I didn't check too hard).