Coding Period: Week 7

5 minute read

Preliminaries

Last week, we cleared a milestone and cleared GSoC mid-term evaluations for the project! Everyone is happy with results and project pace and I hope to keep things in same momentum. We also uploaded a video explaining the results and progress made so far - Tweet. For this week, I have done performance verification of TensorRT optimized video in online simulation. Further each run was recorded for analyzing and verification. We also target to combine all the results so far and present here in one blog post. Moreover to support TensorRT optimization, the repository codes are updated and PRs are created to include the changes. I also present correction in Quantization aware training optimization and benchmarked the corrected model. Since I use my personal computer for simulation, which has limited space, I explored some memory efficients to run all these experiments. I will also present my founds so others can use it for similar situations.

Objectives

Complete video demonstration and Mid-term evaluation
Experiment applying tflite optimization strategy over TensorRT optimized model
Benchmark TensorRT optimized models via simulation
Present all the results table in single blog
Create PRs for includes changes for TensorRT model inference

Additional

Fix Quantization aware training and benchmark
Save and share all simulation recording and trained weights online

Related to use BehaviorMetrics repository:

Add support for inference with TensorRT optimized (TF) models #395

Related to use DeepLearningStudio repository:

Add support for inference optimization with TensorRT for Tensorflow models #71
QAT correction in latest commit - Add support for baseline evaluation and model optimization #67

Important links

BehaviorMetrics simulations - https://drive.google.com/drive/folders/1ovjuWjSy-ea7YtgnaSsgVsHnbo0HJY1A?usp=sharing
Trained weights - https://drive.google.com/drive/folders/1j2nnmfvRdQF5Ypfv1p3QF2p2dpNbXzkt?usp=sharing

The execution

Applying tflite optimization strategies over TensorRT optimized model

Unfortunately, I observed no change from the typical performance of the tflite optimization strategy. After applying them, the size of the models was reduced as expected, but the performance boost was also gone. We can conclude that for our current PilotNet architecture, the chances of TensorRT are not utilized by tflite optimization strategies, therefore there is no benefit in applying them consequently.

Benchmark TensorRT optimized model via Simulation

I used my personal computer with a 4 GB NVIDIA GeForce GTX 1050/PCIe/SSE2 GPU and batch size of 1 (for inference) on SimpleCircuit. Initially I had the idea that TensorRT needs to be installed in BehaviorMetric environment also to utilize them. However I had every less memory remaining on my computer to use 2 conda environments, store trained models and use docker container. So, I tried some methods to reduce the memory needs - Mount volumes into a running container and https://blog.adriel.co.nz/2018/01/25/change-docker-data-directory-in-debian-jessie/. I also encountered an issue - Tensorflow 2.4.1 hangs on any operation using Conda which was actually related to albumentations package and needs re-installation. All these failed attempts took me more than half a day, but I learned a lot during it. Fortunately, at the end I figure out that we can directly utilize TensorRT optimized models.

PilotNet (baseline)

TF-TRT (Precision = Float32)

TF-TRT (Precision = Float16)

TF-TRT (Precision = Int8)

We can observe that the performance is better than all previous optimization methods. The quantitative results are presented together with other experimental resutls. All the simulation videos are share in the link presented above.

Quantization Aware Training correction and benchmarking

Simulation

The correction is included in the PR#67. The simulation results are as follow:

Offline evaluation

The offline results are:

Method	Model size (MB)	MSE	Inference time (s)
Baseline	64.9173469543457	0.04108056542590139	0.01011916732788086
Q aware training	16.250564575195312	0.042138221871067326	0.009550530910491944

This can not be directly compared to other results because the performance depends a lot on hardware load and specifications, but we can compare to baseline. There is a slight improvement.

Combined result table - Simulation

All the simulations are conducted on a 4 GB NVIDIA GeForce GTX 1050/PCIe/SSE2 GPU and batch size of 1 on SimpleCircuit. Only models which have completed one lap are presented here.

Method	Average speed	Position deviation MAE	Brain iteration frequency (RT)	Mean Inference time (s)	Real time factor
PilotNet (original)	8.386	7.406	5.585	0.124	0.557
Dynamic Range Q	8.534	6.693	58.474	0.010	0.54
TF-TRT Baseline	8.536	5.06	73.37	0.0063	0.47
TF-TRT FP32	8.32	4.94	60.28	0.0065	0.50
TF-TRT FP16	8.14	5.39	71.90	0.0056	0.48
TF-TRT Int8	8.01	6.65	59.36	0.0067	0.51

Combined result table - Offline scripting

All the simulations are conducted on a Nvidia V100 GPU with 32GB memory. The batch size was 1024. All subsets of new datasets are used for experiment, testset - SimpleCircuit, Montreal and Montemelo circuits.

Method	Model size (MB)	MSE	Inference time (s)
PilotNet (original tf format)	195	0.041	0.0364
Baseline (tflite format)	64.9173469543457	0.04108056542969754	0.007913553237915039
Dynamic Range Q	16.242530822753906	0.04098070281274293	0.004902467966079712
Float16 Q	32.464256286621094	0.041072421023905605	0.007940708875656129
Weight pruning	64.9173469543457	0.04257505210072217	0.0077278904914855956
Weight pruning + Q	16.242530822753906	0.042606822364652304	0.004810283422470093
Integer only Q	16.244918823242188	28157.721509850544	0.007908073902130127
Integer (float fallback) Q	16.244888305664062	0.04507085706016211	0.00781548523902893

Method	Model size (MB)	MSE	Inference time (s)
PilotNet (original tf format)	195	0.041	0.0364
Baseline (tflite format)	64.9173469543457	0.04108056542969754	0.007913553237915039
CQAT	16.250564575195312	0.0393811650675438	0.007680371761322021
PQAT	16.250564575195312	0.043669467093106665	0.007949142932891846
PCQAT	16.250564575195312	0.039242053481006144	0.007946955680847167

Here, I used the server with a 8 GB NVIDIA GeForce GTX 1080/PCIe/SSE2 GPU and batch size of 128 (for inference).

Method	Model size (MB)	MSE	Inference time (s)
Baseline	195	0.041032556329194385	0.0012623071670532227
Precision fp32	260	0.04103255125749467	0.0013057808876037597
Precision fp16	260	0.04103255125749467	0.0021804444789886475
Precision int8	260	0.04103255125749467	0.0011799652576446533

Summarizing the improvements

We achieved a ~12x reduction in the model memory size with Dynamic range quantization.
We maintain a similar MSE value (at best 0.001 better) as baseline in offline evaluation.
We achieved a ~33x better inference time with TensorRT Int8 optimization and ~7.5x better inference time with Dynamic range quantization in offline evaluation.
We achieved ~0.66x times smaller Position deviation MAE and ~12x time higher Brain iteration frequency (RT) in simulation.
We achieved ~22x time improvement in Mean inference time in simulation.

Recommendations

Tflite optimized models gives better performance than original models with very less memory size. The installation is easy and there is no specific hardware constraints. I would recommend Dynamic range quantization as first optimization method.
TensorRT optimized models have best performance in both offline and simulation. However, they have large memory footprint. If the disk space is not a constraint, I would recommend using Int8 or Float16 precision model.

References

[1] https://github.com/JdeRobot/BehaviorMetrics
[2] https://github.com/JdeRobot/DeepLearningStudio
[3] https://developer.nvidia.com/tensorrt
[4] https://www.tensorflow.org/lite/performance/model_optimization
[5] https://www.tensorflow.org/model_optimization/guide/install
[6] https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html
[7] https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow
[8] https://www.tensorflow.org/model_optimization/guide

Twitter LinkedIn

Coding Period: Week 7

Preliminaries

Objectives

Additional

Important links

The execution

Applying tflite optimization strategies over TensorRT optimized model

Benchmark TensorRT optimized model via Simulation

PilotNet (baseline)

TF-TRT (Precision = Float32)

TF-TRT (Precision = Float16)

TF-TRT (Precision = Int8)

Quantization Aware Training correction and benchmarking

Simulation

Offline evaluation

Combined result table - Simulation

Combined result table - Offline scripting

Summarizing the improvements

Recommendations

References

You May Also Enjoy

Project Summary

Results summary

Coding Period: Week 12

Coding Period: Week 10 & 11

Preliminaries

Objectives

Additional

Related Issues and Pull requests.

Important links

The execution

Applying tflite optimization strategies over TensorRT optimized model

Benchmark TensorRT optimized model via Simulation

PilotNet (baseline)

TF-TRT (Precision = Float32)

TF-TRT (Precision = Float16)

TF-TRT (Precision = Int8)

Quantization Aware Training correction and benchmarking

Simulation

Offline evaluation

Combined result table - Simulation

Combined result table - Offline scripting

Summarizing the improvements

Recommendations

References

You May Also Enjoy

Project Summary

Results summary

Coding Period: Week 12

Coding Period: Week 10 & 11