Skip to content

End to End Guide: Converting and Benchmarking a Model

These steps will walk you through deploying a BNN with LCE. The guide starts by downloading, converting and benchmarking a model from Larq Zoo, and will then discuss the process for a custom model.

Picking a model from Larq Zoo

This example uses the QuickNet model from the sota submodule of larq-zoo. First, install the Larq Ecosystem pip packages:

pip install larq larq-zoo larq-compute-engine

Then, create a python script that will download QuickNet and print the model summary:

import larq as lq
import larq_compute_engine as lce
import larq_zoo as lqz


# Load the QuickNet architecture and download the weights for ImageNet
model = lqz.sota.QuickNet(weights="imagenet")
lq.models.summary(model)
model.save("quicknet.h5")
+quicknet stats----------------------------------------------------------------------------------------------------+
| Layer                   Input prec.                 Outputs  # 1-bit  # 32-bit   Memory  1-bit MACs  32-bit MACs |
|                               (bit)                              x 1       x 1     (kB)                          |
+------------------------------------------------------------------------------------------------------------------+
| input_1                           -  ((None, 224, 224, 3),)        0         0        0           ?            ? |
| conv2d                            -       (-1, 112, 112, 8)        0       216     0.84           0      2709504 |
| depthwise_conv2d                  -        (-1, 56, 56, 64)        0       576     2.25           0      1806336 |
| batch_normalization               -        (-1, 56, 56, 64)        0       128     0.50           0            0 |
| quant_conv2d                      1        (-1, 56, 56, 64)    36864         0     4.50   115605504            0 |
| batch_normalization_1             -        (-1, 56, 56, 64)        0       128     0.50           0            0 |
| add                               -        (-1, 56, 56, 64)        0         0        0           ?            ? |
| quant_conv2d_1                    1        (-1, 56, 56, 64)    36864         0     4.50   115605504            0 |
| batch_normalization_2             -        (-1, 56, 56, 64)        0       128     0.50           0            0 |
| add_1                             -        (-1, 56, 56, 64)        0         0        0           ?            ? |
| max_pooling2d                     -        (-1, 28, 28, 64)        0         0        0           0            0 |
| quant_conv2d_2                    1        (-1, 28, 28, 64)    36864         0     4.50    28901376            0 |
| batch_normalization_3             -        (-1, 28, 28, 64)        0       128     0.50           0            0 |
| batch_normalization_4             -        (-1, 28, 28, 64)        0       128     0.50           0            0 |
| add_2                             -        (-1, 28, 28, 64)        0         0        0           ?            ? |
| concatenate                       -       (-1, 28, 28, 128)        0         0        0           ?            ? |
| quant_conv2d_3                    1       (-1, 28, 28, 128)   147456         0    18.00   115605504            0 |
| batch_normalization_5             -       (-1, 28, 28, 128)        0       256     1.00           0            0 |
| add_3                             -       (-1, 28, 28, 128)        0         0        0           ?            ? |
| quant_conv2d_4                    1       (-1, 28, 28, 128)   147456         0    18.00   115605504            0 |
| batch_normalization_6             -       (-1, 28, 28, 128)        0       256     1.00           0            0 |
| add_4                             -       (-1, 28, 28, 128)        0         0        0           ?            ? |
| max_pooling2d_1                   -       (-1, 14, 14, 128)        0         0        0           0            0 |
| quant_conv2d_5                    1       (-1, 14, 14, 128)   147456         0    18.00    28901376            0 |
| batch_normalization_7             -       (-1, 14, 14, 128)        0       256     1.00           0            0 |
| batch_normalization_8             -       (-1, 14, 14, 128)        0       256     1.00           0            0 |
| add_5                             -       (-1, 14, 14, 128)        0         0        0           ?            ? |
| concatenate_1                     -       (-1, 14, 14, 256)        0         0        0           ?            ? |
| quant_conv2d_6                    1       (-1, 14, 14, 256)   589824         0    72.00   115605504            0 |
| batch_normalization_9             -       (-1, 14, 14, 256)        0       512     2.00           0            0 |
| add_6                             -       (-1, 14, 14, 256)        0         0        0           ?            ? |
| quant_conv2d_7                    1       (-1, 14, 14, 256)   589824         0    72.00   115605504            0 |
| batch_normalization_10            -       (-1, 14, 14, 256)        0       512     2.00           0            0 |
| add_7                             -       (-1, 14, 14, 256)        0         0        0           ?            ? |
| quant_conv2d_8                    1       (-1, 14, 14, 256)   589824         0    72.00   115605504            0 |
| batch_normalization_11            -       (-1, 14, 14, 256)        0       512     2.00           0            0 |
| add_8                             -       (-1, 14, 14, 256)        0         0        0           ?            ? |
| max_pooling2d_2                   -         (-1, 7, 7, 256)        0         0        0           0            0 |
| quant_conv2d_9                    1         (-1, 7, 7, 256)   589824         0    72.00    28901376            0 |
| batch_normalization_12            -         (-1, 7, 7, 256)        0       512     2.00           0            0 |
| batch_normalization_13            -         (-1, 7, 7, 256)        0       512     2.00           0            0 |
| add_9                             -         (-1, 7, 7, 256)        0         0        0           ?            ? |
| concatenate_2                     -         (-1, 7, 7, 512)        0         0        0           ?            ? |
| quant_conv2d_10                   1         (-1, 7, 7, 512)  2359296         0   288.00   115605504            0 |
| batch_normalization_14            -         (-1, 7, 7, 512)        0      1024     4.00           0            0 |
| add_10                            -         (-1, 7, 7, 512)        0         0        0           ?            ? |
| quant_conv2d_11                   1         (-1, 7, 7, 512)  2359296         0   288.00   115605504            0 |
| batch_normalization_15            -         (-1, 7, 7, 512)        0      1024     4.00           0            0 |
| add_11                            -         (-1, 7, 7, 512)        0         0        0           ?            ? |
| quant_conv2d_12                   1         (-1, 7, 7, 512)  2359296         0   288.00   115605504            0 |
| batch_normalization_16            -         (-1, 7, 7, 512)        0      1024     4.00           0            0 |
| add_12                            -         (-1, 7, 7, 512)        0         0        0           ?            ? |
| activation                        -         (-1, 7, 7, 512)        0         0        0           ?            ? |
| average_pooling2d                 -         (-1, 1, 1, 512)        0         0        0           0            0 |
| flatten                           -               (-1, 512)        0         0        0           0            0 |
| dense                             -              (-1, 1000)        0    513000  2003.91           0       512000 |
| activation_1                      -              (-1, 1000)        0         0        0           ?            ? |
+------------------------------------------------------------------------------------------------------------------+
| Total                                                        9990144    521088  3255.00  1242759168      5027840 |
+------------------------------------------------------------------------------------------------------------------+
+quicknet summary-----------------------------+
| Total params                      10.5 M    |
| Trainable params                  10.5 M    |
| Non-trainable params              7.3 k     |
| Model size                        3.18 MiB  |
| Model size (8-bit FP weights)     1.69 MiB  |
| Float-32 Equivalent               40.10 MiB |
| Compression Ratio of Memory       0.08      |
| Number of MACs                    1.25 B    |
| Ratio of MACs that are binarized  0.9960    |
+---------------------------------------------+

As you can see, the model size is 3.18 MiB, but the float-32 model size is 40.10 MiB. Indeed, if you look at the quicknet.h5 file you just saved, you'll see that it is around 42 MiB in size. This is because the model is currently unoptimized and the weights are still stored as floats rather than binary values, so executing this model on any device wouldn't be very fast at all.

Converting the model

Larq Compute Engine is built on top of TensorFlow Lite, and therefore uses the TensorFlow Lite FlatBuffer format to convert and serialize Larq models for inference. We provide our own LCE Model Converter to convert models from Keras to flatbuffers, containing additional optimization passes that increase the execution speed of Larq models on the supported target platforms.

Using this converter is very simple, and can be done by adding the following code to the python script above:

# Convert our Keras model to a TFLite flatbuffer file
with open("quicknet.tflite", "wb") as flatbuffer_file:
    flatbuffer_bytes = lce.convert_keras_model(model)
    flatbuffer_file.write(flatbuffer_bytes)

This will produce the converted quicknet.tflite file with compressed weights and optimized operations, which is only just over 3 MiB in size!

Benchmarking

This part of the guide assumes that you'll want to benchmark on an 64-bit ARM based system such as a Raspberry Pi. For more detailed instructions on benchmarking and for benchmarking on Android phones, see the Benchmarking guide.

On ARM, benchmarking is as simple as downloading the pre-built benchmarking binary from the latest release to the target device and running it with the converted model:

Warning

The following code should be executed on the target platform, e.g. a Raspberry Pi. The exclamation marks should be removed, but are necessary here to make this valid notebook syntax.

!wget https://github.com/larq/compute-engine/releases/download/v0.4.2/lce_benchmark_model_aarch64 -o lce_benchmark_model
!chmod +x lce_benchmark_model
!./lce_benchmark_model --graph=quicknet.tflite --num_runs=50 --num_threads=1
STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
...
...
Loaded model quicknet.tflite
The input model file size (MB): 3.3512
Initialized session in 1.471ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=16 first=41707 curr=31870 min=31722 max=41707 avg=32491.4 std=2395

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=31929 curr=31278 min=31138 max=31929 avg=31420.6 std=260

Inference timings in us: Init: 1471, First inference: 41707, Warmup (avg): 32491.4, Inference (avg): 31420.6

The number of interest here is Inference (avg), which in this case is 31.4 ms (31420.6 microseconds) on a Raspberry Pi 4B.

To see the other available benchmarking options, add --help to the command above.

Create your own Larq model

Instead of using one of our models, you probably want to benchmark a custom model that you trained yourself. For more information on creating and training a BNN with Larq, see our Larq User Guides. For best practices on optimizing Larq models for LCE, also see our Model Optimization Guide.

The code below defines a simple BNN model that takes a 32x32 input image and classifies it into one of 10 classes.

import tensorflow as tf

# Define a custom model
model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Input((32, 32, 3), name="input"),
        # First layer (float)
        tf.keras.layers.Conv2D(32, kernel_size=(5, 5), padding="same", strides=3),
        tf.keras.layers.BatchNormalization(),
        # Note: we do NOT add a ReLU here, because the subsequent activation quantizer would destroy all information!
        # Second layer (binary)
        lq.layers.QuantConv2D(
            32,
            kernel_size=(3, 3),
            padding="same",
            strides=2,
            input_quantizer="ste_sign",
            kernel_quantizer="ste_sign",
            kernel_constraint="weight_clip",
            use_bias=False  # We don't need a bias, since the BatchNorm already has a learnable offset
        ),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Activation("relu"),
        # Third layer (binary)
        lq.layers.QuantConv2D(
            64,
            kernel_size=(3, 3),
            padding="same",
            strides=2,
            input_quantizer="ste_sign",
            kernel_quantizer="ste_sign",
            kernel_constraint="weight_clip",
            use_bias=False  # We don't need a bias, since the BatchNorm already has a learnable offset
        ),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Activation("relu"),
        # Pooling and final dense layer (float)
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(10, activation="softmax"),
    ]
)
lq.models.summary(model)

# Note: Realistically, you would of course want to train your model before converting it!

# Convert our Keras model to a TFLite flatbuffer file
with open("custom_model.tflite", "wb") as flatbuffer_file:
    flatbuffer_bytes = lce.convert_keras_model(model)
    flatbuffer_file.write(flatbuffer_bytes)
+sequential_4 stats---------------------------------------------------------------------------------------------+
| Layer                       Input prec.           Outputs  # 1-bit  # 32-bit  Memory  1-bit MACs  32-bit MACs |
|                                   (bit)                        x 1       x 1    (kB)                          |
+---------------------------------------------------------------------------------------------------------------+
| conv2d_5                              -  (-1, 11, 11, 32)        0      2432    9.50           0       290400 |
| batch_normalization_29                -  (-1, 11, 11, 32)        0        64    0.25           0            0 |
| quant_conv2d_21                       1    (-1, 6, 6, 32)     9216         0    1.12      331776            0 |
| batch_normalization_30                -    (-1, 6, 6, 32)        0        64    0.25           0            0 |
| activation_10                         -    (-1, 6, 6, 32)        0         0       0           ?            ? |
| quant_conv2d_22                       1    (-1, 3, 3, 64)    18432         0    2.25      165888            0 |
| batch_normalization_31                -    (-1, 3, 3, 64)        0       128    0.50           0            0 |
| activation_11                         -    (-1, 3, 3, 64)        0         0       0           ?            ? |
| global_average_pooling2d_4            -          (-1, 64)        0         0       0           ?            ? |
| dense_5                               -          (-1, 10)        0       650    2.54           0          640 |
+---------------------------------------------------------------------------------------------------------------+
| Total                                                        27648      3338   16.41      497664       291040 |
+---------------------------------------------------------------------------------------------------------------+
+sequential_4 summary--------------------------+
| Total params                      31 k       |
| Trainable params                  30.7 k     |
| Non-trainable params              256        |
| Model size                        16.41 KiB  |
| Model size (8-bit FP weights)     6.63 KiB   |
| Float-32 Equivalent               121.04 KiB |
| Compression Ratio of Memory       0.14       |
| Number of MACs                    789 k      |
| Ratio of MACs that are binarized  0.6310     |
+----------------------------------------------+

Now that the model is converted, it is useful to visualize it in Netron to make sure the network looks as expected. Simply go to Netron and select the .tflite file to visualize it (press Ctrl + K to switch to horizontal mode). For the model above, the first part of the flatbuffer looks like this:

There is an unexpected Mul operation between the two binary convolutions, fused with a ReLU. Since BatchNormalization and ReLU can be efficiently fused into the convolution operation, this indicates that something about our model configuration is suboptimal.

There are two culprits here, both explained in the Model Optimization Guide:

  1. The order of BatchNormalization and ReLU is incorrect. Not only does this prevent fusing these operators with the convolution, but since ReLU produces only positive values, the subsequent LCEQuantize operation will turn the entire output into ones, and the network cannot learn anything. This can be easily fixed by reversing the order of these two operations:

    # Example code for a correct ordering of binary convolution, ReLU and BatchNorm.
    lq.layers.QuantConv2D(
        32,
        kernel_size=(3, 3),
        padding="same",
        strides=2,
        input_quantizer="ste_sign",
        kernel_quantizer="ste_sign",
        kernel_constraint="weight_clip",
        use_bias=False   # We don't need a bias, since the BatchNorm already has a learnable offset
    )
    tf.keras.layers.Activation("relu")
    tf.keras.layers.BatchNormalization()
    
  2. However, if you change the model definition above to incorporate these changes, the graph looks like this:

    Which is even worse, because there are now two unfused operations (ReLU and Mul) instead of one (Mul with fused ReLU).

    This is because while the binary convolutions use padding="same", no padding value was specified and therefore the default value of 0 is used. Since binary weights can only take two values, -1 and 1, this 0 cannot also be represented in the existing input tensor, so an additional correction step is necessary and the ReLU cannot be fused. This can be resolved by using pad_values=1 for the binary convolutions:

    # Example code for a fusable configuration of a binary convolution with "same" padding, including ReLU and BatchNorm.
    lq.layers.QuantConv2D(
        32,
        kernel_size=(3, 3),
        padding="same",
        pad_values=1,
        strides=2,
        input_quantizer="ste_sign",
        kernel_quantizer="ste_sign",
        kernel_constraint="weight_clip",
        use_bias=False   # We don't need a bias, since the BatchNorm already has a learnable offset
    )
    tf.keras.layers.Activation("relu")
    tf.keras.layers.BatchNormalization()
    

After making these final changes to the model definition above, the model looks correct at last:

The ReLu and BatchNormalization operations have now successfully been fused into the convolution operation, meaning the inference engine just has to execute a single operation instead of three!

Next Steps

Now that you've succesfully created and benchmarked your own BNN, you probably want to use it for a custom application. For information on using the Larq Compute Engine for inference, check out our C++ and Android inference guides.