Coding Period: Week 7

5 minute read

Preliminaries

Last week, we cleared a milestone and cleared GSoC mid-term evaluations for the project! Everyone is happy with results and project pace and I hope to keep things in same momentum. We also uploaded a video explaining the results and progress made so far - Tweet. For this week, I have done performance verification of TensorRT optimized video in online simulation. Further each run was recorded for analyzing and verification. We also target to combine all the results so far and present here in one blog post. Moreover to support TensorRT optimization, the repository codes are updated and PRs are created to include the changes. I also present correction in Quantization aware training optimization and benchmarked the corrected model. Since I use my personal computer for simulation, which has limited space, I explored some memory efficients to run all these experiments. I will also present my founds so others can use it for similar situations.

Objectives

  • Complete video demonstration and Mid-term evaluation
  • Experiment applying tflite optimization strategy over TensorRT optimized model
  • Benchmark TensorRT optimized models via simulation
  • Present all the results table in single blog
  • Create PRs for includes changes for TensorRT model inference

Additional

  • Fix Quantization aware training and benchmark
  • Save and share all simulation recording and trained weights online

Related to use BehaviorMetrics repository:

Related to use DeepLearningStudio repository:

The execution

Applying tflite optimization strategies over TensorRT optimized model

Unfortunately, I observed no change from the typical performance of the tflite optimization strategy. After applying them, the size of the models was reduced as expected, but the performance boost was also gone. We can conclude that for our current PilotNet architecture, the chances of TensorRT are not utilized by tflite optimization strategies, therefore there is no benefit in applying them consequently.

Benchmark TensorRT optimized model via Simulation

I used my personal computer with a 4 GB NVIDIA GeForce GTX 1050/PCIe/SSE2 GPU and batch size of 1 (for inference) on SimpleCircuit. Initially I had the idea that TensorRT needs to be installed in BehaviorMetric environment also to utilize them. However I had every less memory remaining on my computer to use 2 conda environments, store trained models and use docker container. So, I tried some methods to reduce the memory needs - Mount volumes into a running container and https://blog.adriel.co.nz/2018/01/25/change-docker-data-directory-in-debian-jessie/. I also encountered an issue - Tensorflow 2.4.1 hangs on any operation using Conda which was actually related to albumentations package and needs re-installation. All these failed attempts took me more than half a day, but I learned a lot during it. Fortunately, at the end I figure out that we can directly utilize TensorRT optimized models.

PilotNet (baseline)

TF-TRT (Precision = Float32)

TF-TRT (Precision = Float16)

TF-TRT (Precision = Int8)

We can observe that the performance is better than all previous optimization methods. The quantitative results are presented together with other experimental resutls. All the simulation videos are share in the link presented above.

Quantization Aware Training correction and benchmarking

Simulation

The correction is included in the PR#67. The simulation results are as follow:

Offline evaluation

The offline results are:

Method Model size (MB) MSE Inference time (s)
Baseline 64.9173469543457 0.04108056542590139 0.01011916732788086
Q aware training 16.250564575195312 0.042138221871067326 0.009550530910491944

This can not be directly compared to other results because the performance depends a lot on hardware load and specifications, but we can compare to baseline. There is a slight improvement.

Combined result table - Simulation

All the simulations are conducted on a 4 GB NVIDIA GeForce GTX 1050/PCIe/SSE2 GPU and batch size of 1 on SimpleCircuit. Only models which have completed one lap are presented here.

Method Average speed Position deviation MAE Brain iteration frequency (RT) Mean Inference time (s) Real time factor
PilotNet (original) 8.386 7.406 5.585 0.124 0.557
Dynamic Range Q 8.534 6.693 58.474 0.010 0.54
TF-TRT Baseline 8.536 5.06 73.37 0.0063 0.47
TF-TRT FP32 8.32 4.94 60.28 0.0065 0.50
TF-TRT FP16 8.14 5.39 71.90 0.0056 0.48
TF-TRT Int8 8.01 6.65 59.36 0.0067 0.51

Combined result table - Offline scripting

All the simulations are conducted on a Nvidia V100 GPU with 32GB memory. The batch size was 1024. All subsets of new datasets are used for experiment, testset - SimpleCircuit, Montreal and Montemelo circuits.

Method Model size (MB) MSE Inference time (s)
PilotNet (original tf format) 195 0.041 0.0364
Baseline (tflite format) 64.9173469543457 0.04108056542969754 0.007913553237915039
Dynamic Range Q 16.242530822753906 0.04098070281274293 0.004902467966079712
Float16 Q 32.464256286621094 0.041072421023905605 0.007940708875656129
Weight pruning 64.9173469543457 0.04257505210072217 0.0077278904914855956
Weight pruning + Q 16.242530822753906 0.042606822364652304 0.004810283422470093
Integer only Q 16.244918823242188 28157.721509850544 0.007908073902130127
Integer (float fallback) Q 16.244888305664062 0.04507085706016211 0.00781548523902893
Method Model size (MB) MSE Inference time (s)
PilotNet (original tf format) 195 0.041 0.0364
Baseline (tflite format) 64.9173469543457 0.04108056542969754 0.007913553237915039
CQAT 16.250564575195312 0.0393811650675438 0.007680371761322021
PQAT 16.250564575195312 0.043669467093106665 0.007949142932891846
PCQAT 16.250564575195312 0.039242053481006144 0.007946955680847167

Here, I used the server with a 8 GB NVIDIA GeForce GTX 1080/PCIe/SSE2 GPU and batch size of 128 (for inference).

Method Model size (MB) MSE Inference time (s)
Baseline 195 0.041032556329194385 0.0012623071670532227
Precision fp32 260 0.04103255125749467 0.0013057808876037597
Precision fp16 260 0.04103255125749467 0.0021804444789886475
Precision int8 260 0.04103255125749467 0.0011799652576446533

Summarizing the improvements

  • We achieved a ~12x reduction in the model memory size with Dynamic range quantization.
  • We maintain a similar MSE value (at best 0.001 better) as baseline in offline evaluation.
  • We achieved a ~33x better inference time with TensorRT Int8 optimization and ~7.5x better inference time with Dynamic range quantization in offline evaluation.
  • We achieved ~0.66x times smaller Position deviation MAE and ~12x time higher Brain iteration frequency (RT) in simulation.
  • We achieved ~22x time improvement in Mean inference time in simulation.

Recommendations

  • Tflite optimized models gives better performance than original models with very less memory size. The installation is easy and there is no specific hardware constraints. I would recommend Dynamic range quantization as first optimization method.
  • TensorRT optimized models have best performance in both offline and simulation. However, they have large memory footprint. If the disk space is not a constraint, I would recommend using Int8 or Float16 precision model.

References

[1] https://github.com/JdeRobot/BehaviorMetrics
[2] https://github.com/JdeRobot/DeepLearningStudio
[3] https://developer.nvidia.com/tensorrt
[4] https://www.tensorflow.org/lite/performance/model_optimization
[5] https://www.tensorflow.org/model_optimization/guide/install
[6] https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html
[7] https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow
[8] https://www.tensorflow.org/model_optimization/guide