Dealing with large data files is something of a more advanced programming topic;
one that I’ve found that a lot of my younger coworkers don’t know a lot about.
I’d like to share a bit of wisdom on the topic in this file.
Let’s say you have a file that is some reasonable fraction of the size of the
memory on a computer. As a concrete example, let’s say you have a 4GB JSON file
on a machine with 8GB of RAM.
You might think it would be safe to do something like this with it in python:
with open('very_large_file.json') as large_file:
data = json.load(large_file)
This is a valid Python program, but it’s probably going to end disastrously.
That JSON file is 4GB of text, but when it’s loaded into Python it’s going to
turn into a lot of integer objects and string objects and list objects and
dictionary objects and who knows what. Some of those objects might be pretty
big. In C frequently an
int is 4 bytes, but in Python that innocuous
object is stored in a
PyObject * container that is quite a bit bigger. In my
experience, if you do something like this you’ll end up with some small integer
multiple of the size of the original file in actual memory, so something like
this might requie 8-12GB of memory to load in Python after all is said in done.
efficiency since JSON is part of their reason d’etre, but even then the string
‘4’ is probably much smaller than the JSON number representation of 4. It all
depends on exactly what data types you’re storing, how much of the JSON file is
whitespace and field delimiters, and a few other things, but a safe bet here is
that the loaded JSON representation is going to be much larger than 4GB, and
likely will exceed the total system memory and cause swapping.
Clearly things will be even worse if you have a data file that’s much bigger
than your resident RAM. For instance, say you still have your 8GB of memory and
have to process a data file that’s 100GB. The
methods you’d get out of the box clearly aren’t going to work at all in this
Also, I’m using JSON as an example here only because it’s in vogue right now,
but the same class of problems is true of XML and most other data interchange
formats (YAML, ASN.1, Protobuf, Thrift, etc.).
This is also a problem when generating huge files. Suppose, hypothetically,
that you have a SQL database and you want to generate all of it or some subset
of it into JSON or XML dump or something like that. Relational databases are
designed for handling large data sets much larger than resident memroy, so it’s
common to run be in a situation where a naive program will exceed all of the
memory on the system and cause swapping or an OOM situation when generating a
large data file.
Fortunately, a long time ago smart people thought of a way to handle
this—generative parsing. People generally use the Java/XML terminology here
because parsing and generating big XML feeds used to be a big thing (and still
is), but the model here applies to any programming language.
Two two dominant paradigms or parsing or generating a big XML/JSON/whatever are
called the DOM or ElementTree model, and the
SAX model. In particular SAX
was I believe the name of a
particular implementation of an event-driven XML
In the DOM or ElementTree model you get an object that represents the whole
document and you just do what you want with it. This is analagous to working
json.dumps in Python with JSON, or what you’d get with
lxml.etree in Python with XML. It’s definitely the easiest model to work with,
because you just have a big object and access methods on it the usual way. This
model is simple, but only works well when you have relatively small data sets.
The SAX model is totally different. The basic idea is that when parsing, you get
a callback for each token encountered in the document. Keeping state is up to
you. Same with generating a document: you call a bunch of methods that emit
tokens, and the parser will handle turning that into valid JSON/XML/whatever. If
the library you’re using is in an object-oriented language like C++, Python, or
Java you’ll often create a subclass using inheritance and overload methods on
the class, whereas in C you’ll generally just register callbacks with function
pointers. In either case the idea is pretty much the same.
I ran into this recently where I had to generate a huge data feed that is
currently going to be about 1GB per data file, but potentially might be much
larger in the future (up to 1TB for now). The hosts this script runs on actually
have a fair amount of free memory, so I could have just done things the simple
way; but as they say, “every byte is sacred”, and I wanted my program to be able
to generate and parse huge data files efficiently.
The first version of the program I wrote emitted all of the data into a CSV
file. CSV is great because you can just emit one line at a time, so it’s not
memory hungry if you write your code iteratively. Same with parsing—you can
parse one line at a time, so if you write your code correctly it will be
efficient. However, I ran into a bunch of problems:
- CSV has poor (or at least, non-standard) escaping mechanisms, and some of the
data I had to include was binary or had weird characters; so I ended up hex
and base64 encoding a lot of text.
- CSV has no notion of types, so when parsing on the way out you need to know
that at column 4 even though you see strings like ‘12345’ and ‘666’ they
actually represent the integer numbers 12345 and 666.
- The real kicker for me was that the originally flat data that I was generated
ended up being somewhat hierarchical. What I was doing was dumping a sampled
subset of one database and one table in that database. At first the database
and table went in the filename of the
.csv file and the column format was
already well known. Later, I realized that I’m going to have to end up
dumping multiple databases and multiple tables within each database. So my
- Stay with one generated
.csv file and add a column for the database
name and a column for the table name. This would work, but would result
in a huge amount of data size bloat for a tiny amount of metadata, since
the cardinality of those columns would be very low but they’d always be
- Keep the original file format and create multiple CSV files for each
database/table dumped, with the name encoded as before.
- The more I thought about it, the more I realized that I sort of wanted to add
some other metadata (e.g the hostname of the host the dump came from), and
trying to encode all of that data in the filename, or adding in all of the
data bloat from the extra fields, made me realize that CSV was the wrong
So I decided to encode the dump as a hierarchical JSON file. The format is
As you can see, it’s easy to extend this to add more metadata, handle multiple
databases, handle multiple tables, etc.
In C I’ve previously used YAJL. YAJL is
awesome. It’s really fast, and it’s really easy to use. It also gives you a lot
of knobs, so for instance you can control memory usage vs encoding/decoding
In Python I didn’t really find any good solutions, until I stumbled across
yajl-py. Yajl-py is a really simple wrapper
around YAJL that uses ctypes.
If you’ve already uses YAJL, or if you’re already used to SAX style parsers and
generators it’s a great solution, and you’ll pick it up in minutes.
I have nothing but positive words to say about the yajl-py project, and the
maintainer has been friendly and accepting as I’ve contributed a few small fixes
and patches. I’ve also found the speed to be very good.
Note: I could have also used lxml for this project to instead generate XML,
since lxml already has a SAX style interface, is very fast (it uses the libxml2
and libxml2 is itself very performant). I mostly decided to use JSON because I
prefer it as a file format (at least for the type of data that I needed), a lot
of younger programmers are more familiar with JSON, and I liked the control I
got with choosing how the buffering and flushing would work with YAJL (although
there may be an option for this in lxml, I didn’t check too hard).