One of the really great things about Tensorflow is how easy it makes it to offload computations to the GPU. Tensorflow can do this more or less automatically if you have an Nvidia GPU and the CUDA tools and libraries installed. But just because Tensorflow offloads computations to the GPU doesn't mean you'll get good performance. In fact, it's not uncommon to get significantly worse performance when using a GPU than you would if you ran your compute graphs on the CPU.
There are two main reasons that using a GPU can be slower than the CPU:
- Launching CUDA kernels has a higher baseline overhead than CPU kernels, by about 5x.
- Naïve programs may end up transferring a large amount of data back between main memory and GPU memory during each training epoch.
Writing big, expensive network models is easy, so usually the first point isn't the problem. It's much more common to run into problems where data is unnecessarily being copied back and forth between main memory and GPU memory. This is the same problem that OpenGL programmers have faced for years: copying vertex data between main memory and the GPU is expensive, so a big part of writing high performance OpenGL code is figuring out how to keep vertex data on the GPU.
There are two ways to copy NumPy arrays from main memory into GPU memory:
- You can pass the array to a Tensorflow session using a
feed_dict
. - You can use
tf.constant()
to load the array into atf.Tensor
.
Most of the models and tutorials you'll find online use the first approach,
copying the data using a feed_dict
. This always copies the data from main
memory to the GPU. For huge datasets that can't entirely fit onto the GPU, this
is often fine. For instance, if you have hundreds of gigabytes of image or video
data, your dataset will vastly exceed the available space in the GPU, so it's
easy to fill the GPU with each mini-batch. Furthermore, contemporary CNNs are
quite deep and are fairly expensive to run, so the memory transfer overhead is
low compared to how long the compute graph take to run. However, if you're
dealing with smaller datasets that can fit entirely in GPU memory (e.g. with
text or numeric datasets), you can get much better performance using
tf.constant()
to pin your dataset into GPU memory. The problem with doing this
is that neural networks tend to overfit training data unless the training data
is split up into mini-batches, so reusing the same tf.constant()
for each
training epoch will lead to poor generalization.
After a lot of internet sleuthing, I found a cryptic StackOverflow answer
suggesting a clever solution to this problem: load the entire dataset using
tf.constant()
, and then use tf.slice()
to grab mini-batches from the
constant. For instance, let's say you have an Nvidia GPU with 8 GB of memory,
and your dataset is smaller than 8 GB. During training, you want to split the
dataset into 100 mini-batches. The idea is that in each training epoch you would
pass the slice indexes into the session via a feed_dict
, and then the compute
graph you've written would use tf.slice()
to generate the mini-batch. Using
this approach requires only sending the slice indexes via a feed_dict
, which
will be small scalar values. The idea is really elegant, but I found actually
figuring out how to implement this to be kind of tricky.
Example Code
I'm going to demonstrate this technique with a small Python 3 program that
generates mini-batches from a tf.constant()
using the tf.slice()
operator.
I've also created a GitHub repo with the
full code for this example, if you want something you can download and actually
run locally.
To keep the code simple, we're going to write a Tensorflow compute graph that applies a simple numeric operation to a small 10⨯3 matrix:
import numpy as np
# Height of our input data.
HEIGHT = 10
# The size of each mini-batch.
BATCH_SIZE = 2
# Create a 10x3 matrix in numpy; this lives in main memory (*not* the GPU).
np_data = np.array(range(30), dtype=np.float32).reshape(10, 3)
The code above will create a NumPy array called np_data
that looks like this:
# Contents of np_data.
array([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.],
[ 9., 10., 11.],
[ 12., 13., 14.],
[ 15., 16., 17.],
[ 18., 19., 20.],
[ 21., 22., 23.],
[ 24., 25., 26.],
[ 27., 28., 29.]], dtype=float32)
For this demo our mini-batches will have size 2, meaning that they will be 2⨯3
matrices. The first mini-batch would be equivalent to np_data[:2]
, the
second mini-batch would be equivalent to np_data[2:4]
, and so on. Of course,
we won't actually be using NumPy slicing; instead we'll be using Tensorflow
operators.
The next step is to copy np_data
into Tensorflow's data graph. Tensorflow will
automatically use a GPU if available, but you can also use a tf.device()
context to force the location.
import tensorflow as tf
# Copy the numpy data into TF memory as a constant var; this will be copied
# exactly one time into the GPU (if one is available).
tf_data = tf.constant(np_data, dtype=tf.float32)
Generating a mini-batch is done by supplying a batch index via a placeholder
called ix
, and then a mini-batch is generating using tf.slice()
with the
batch index:
# The index to use when generating our mini-batch.
ix = tf.placeholder(shape=(), dtype=tf.int32)
# The mini-batch of data we'll work on.
batch = tf.slice(tf_data, [BATCH_SIZE * ix, 0], [BATCH_SIZE, -1])
I found the documentation
for tf.slice()
to be pretty confusing, so I'll explain here in plain English
how it works. The begin
argument, which is [BATCH_SIZE * ix, 0]
in the code
above, is the index of the upper-left corner of the slice we're creating. The
index is multiplied by BATCH_SIZE
because the ix
values are in the range 0
to 4, so they need to be scaled to get the true offset into the matrix. The
size
argument, which is [BATCH_SIZE, -1]
in the code above, says how many
rows to go down and how many columns to go right. The special value -1 means
"all columns"; I could have also used 3 here, since that's the width of the
matrix.
The value we're going to calculate with our compute graph is the sum of the squares of the values in our mini-batch:
# The output of the Tensorflow graph.
outp = tf.reduce_sum(tf.square(batch))
For this demonstration, we'll run the compute graph 100 times. We'll also shuffle the batch order. In this example I've initialized the dataset with random data, so shuffling isn't necessary. However, in a real neural network shuffling the mini-batch order is helpful since it helps fight any locality patterns in the input data (e.g. if earlier batches tend to have small numeric values, and later batches tend to have larger numeric values). Shuffling the data this way can help combat overfitting:
import random
# Number of epochs to train for.
EPOCHS = 100
# Shuffle the indexes of mini-batches, so that the mini-batches are generated
# in a random order. This helps break locality in the structure of the training
# dataset, which can help with overfitting.
INDEXES = list(range(HEIGHT // BATCH_SIZE))
random.shuffle(INDEXES)
The training loop is very simple. All it does is pass the batch index (a single 32-bit integer) into a Tensorflow session:
# Create and initialize a TF session.
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(EPOCHS):
for i in INDEXES:
# Run the computation. The only data in the feed_dict is a single
# 32-bit integer we supply here. All of the data needed for the
# mini-batch already lives in GPU memory, and doesn't need to be
# copied from main memory.
b, o = sess.run([batch, outp], feed_dict={ix: i})
print('epoch = {}, ix = {}'.format(epoch, i))
print('batch: {}'.format(b))
print('output: {}'.format(o))
There are a lot of variations that you can make on this same theme. Here are a few I thought of while writing this post:
- If the number of input records isn't evenly divisible by the batch size, the final mini-batch will be smaller than the other ones. The easiest way to handle this case is by supplying two index parameters (start and length).
- For models where the training data set is way too large to fit in memory, but the size of the labels is small, you can pin just the training labels in GPU memory. This is a somewhat common access pattern with image or video classification tasks.
- When the training set is too large for GPU memory, but the mini-batch sizes are relatively small, you could try filling the GPU with multiple mini-batches at once rather than copying one mini-batch each epoch. This might help amortize the transfer time.
Because this technique tends to make designing models more complicated, I would suggest implementing it only after you're satisfied with the basic structure of your model. That's the best time to start looking at optimizing training times, and that's when I would consider employing this technique.