Project Aim

The main aim of this project is to improve the current model stack of deep learning models, in terms of inference speed with minimum loss of precision, for autonomous driving applications. JdeRobot organization has created Behavior Metrics, a tool for comparing deep learning architectures for autonomous driving on different circuits with the support of Gazebo and Ros Noetic. The organization also provides another tool called DeepLearningStudio, which has datasets and model implementations for training deep learning models. I have used the available tools and techniques such as TensorRT, Quantization, Pruning, and variants to optimize the current model stack available in both PyTorch and TensorFlow framework.


Final simulation video

The following video provides an overview of the complete project and a comparison between the final models. All implementations, documentation and experiments are contributed to the two official repositories of the JdeRobot organization, i.e., BehaviorMetrics and DeepLearningStudio via Issues and Pull Requests (PRs).

TensorFlow Framework

I started my work with the TensorFlow framework. Till Phase 1 evaluation, the majority of goals for implementing and experimenting with optimized models were completed. The following video provides a overview of the performance.

TF Lite Optimization

All the optimization strategies are supported by TensorFlow Lite. There are majorly two categories - Quantization and Pruning. Counting the variants and combinations of techniques, a total of 10 optimization strategies were implemented. Moreover, proper benchmarking was done in offline mode over the test set (calculating Mean square error, modal size, and inference time) and online model via live simulation on Gazebo.

TensorFlow-TensorRT (TF-TRT)

TensorRT is an SDK for high-performance deep learning inference. It utilized various techniques such as precision calibration, layer fusion, kernel tuning, etc. It also supports TensorFlow via the TensorFlow-TensorRT package. I have optimized the models in three different precision (Int 8, float 16, and float 32) and achieved excellent results.

PyTorch Framework

PyTorch also supports similar optimization strategies. There is the support of TorchScript for inference and just changing the framework (from TF to PyTorch) shows quite a performance difference. Most of the optimization strategies only support CPU inference, which could be a major downside. I still received comparable and mostly better results with PyTorch. There is a total of 6 strategies implemented.

I have performed an extensive range of experiments on each optimization strategy and compiled the combined results in a separate blog page - Results Summary.

Summarizing the improvements

The essential observations and improvements presented below that I have achieved through the project. For exact numerical figures, please see the result page (mentioned above). For code implementation, please refer to the individual PRs.

TensorFlow framework

  • We achieved a ~12x reduction in the model memory size with Dynamic range quantization.
  • We maintain a similar MSE value (at best 0.001 better) as the baseline in offline evaluation.
  • We achieved a ~33x better inference time with TensorRT Int8 optimization and ~7.5x better inference time with Dynamic range quantization in offline evaluation.
  • We achieved ~0.66x times less Position deviation MAE and ~12x time higher Brain iteration frequency (RT) in simulation.
  • We achieved ~22x time improvement in Mean inference time in simulation.

PyTorch framework

  • We achieved, at best, a ~4x memory reduction with the Static Quantization technique.
  • The MSE is improved by 0.011 from baseline using Prune + Quantization. Other techniques also give a slightly better performance.
  • All the methods inference time in same scale ( 10 -3 ) of magnitude. The Local prune strategy has the best inference speed (CPU).
  • Quantization + Prune gives the best performance overall. However, all other strategies are also very close.


  • PyTorch optimized models are smaller in size and have better inference time. I would recommend the Global/Local Prune strategy for a start because they also support inference with GPU. Quantization + Prune has the least MSE but only supports CPU inference.
  • Tflite optimized models give better performance than original models with very less memory sizes. The installation is easy and there are no specific hardware constraints. I would recommend Dynamic range quantization as the first optimization method.
  • TensorRT optimized models have the best performance in both offline and simulation. However, they have a large memory footprint. If the disk space is not a constraint, I would recommend using the Int8 or Float16 precision model.

Future work

TensorRT also supports the PyTorch framework with Torch-TensorRT compiler. It would be interesting to compare the same TensorRT optimizations between the two popular deep learning frameworks (TensorFlow and PyTorch). We already have everything set up for offline and online benchmarking, which could provide additional insights for our project. I have saved helpful resources for installation and development as follows: