Using TensorFlow Lite to Speed up Predictions

Speed up predictions on individual records or small batches by converting a Keras/TensorFlow 2.1 model to TensorFlow Lite

2 min readMar 10, 2020

I found myself in a situation where I had to run predictions on large numbers of individual data records using Keras/TensorFlow 2.1 models where creating batches was impractical.

Calling model.predict(input) incurs a large overhead and is very slow unless it is used on large batches. The recommended method of calling the model directly, model(input), improves performance but it is still far from ideal. It turns out that we can do much better by converting the model to TensorFlow Lite.

For this benchmark I used the following simple model:

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(384, activation='elu', input_shape=(256,)),
    tf.keras.layers.Dense(384, activation='elu'),
    tf.keras.layers.Dense(256, activation='elu'),
    tf.keras.layers.Dense(128, activation='elu'),
    tf.keras.layers.Dense(32, activation='tanh')
])

Then I invoked it in several different ways:

kmodel-predict-batch: Call model.predict(input) on a complete batch.
kmodel-predict-single: Call model.predict(input) in a loop on single records.
kmodel-direct-single: Call model(input) in a loop on single records.
lmodel-single: Use Tensorflow Lite in a loop on single records.

Here are the results on a 6-core Intel i7–8750H CPU @ 2.20GHz (Windows 10, Python 3.7, Tensorflow 2.1.0):

Execution time as a function of the record count using different methods to compute predictions

The overhead of a call to model.predict(input) is 18ms, while a call to model(input) takes 1.3ms (a 14x speedup). A call to the TensorFlow Lite model takes 43us (an additional 30x speedup). However, the conversion of the Keras model to TensorFlow Lite takes approximately 2 seconds.

I did not see any difference with or without GPU support (maybe that is only used for training). Also, there was only a negligible difference between a Keras model before and after compilation.

Below is the code used for this benchmark. The class LiteModel encapsulates the TensorFlow Lite model and provides a predict() method that works like the Keras equivalent. (Note that it would need to be modified for use with models that don’t just have a single 1-D input and a single 1-D output.)

Benchmark Code

Using TensorFlow Lite to Speed up Predictions

Speed up predictions on individual records or small batches by converting a Keras/TensorFlow 2.1 model to TensorFlow Lite

Written by Michael Wurm