Week 4

An important part of the exercise is to provide the user with relevant information and guides to train and fine tune their models in different frameworks.This would include everything from making the process of collecting data, preprocessing it and fine tuning with it on a pre-existing model architecture. Since the process of exporting models to ONNX format is different for different frameworks, including so under each framework’s guide becomes necessary.

PyTorch Guide

I have documented a guide for the PyTorch implementation. Please refer to it below for the detailed information.

SSDMobilenet_PyTorch_FineTune

In short this is what the guide contains:

Testing Pytorch based pre-trained Mobilenet_SSD model to detect humans and converting it into the ONNX format.
Fine tuning the model on custom dataset for a specific class of object from open-image dataset.
Testing and inferring on the fine tuned model.
Converting the Fine Tuned model to ONNX format.

Challenges and errors faced

Collecting all the relevant information to make the process of fine tuning as easy as possible was a challenge in itself.
The ONNX conversion code used in the guide loaded the fine tuned state dictionary to the model architecture in a cpu specific format, whereas I was using a GPU environment on colab. This resulted in the following error:
```
expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
```
This was solved buy moving the model and the dummy input to device type CUDA
```
net.to("cuda")
dummy_input = torch.Variable(1,3,300,300).to("cuda")
```
The ONNX conversion code also gave a strange error while conversion
```
IndexError: Input 475 is undefined
```
Debugging this was strange as the error did not display any other information. Checking the error traceback made me realise that this occurred during the onnx_graph_to_caffe2_net(model) function. I analyzed the code and realized there was no need for this function in my implementation, so I got rid of it. Everything worked pretty fine after that.

TensorFlow Guide

Luckily, there exists a quite famous object detection API for TensorFlow models called the TensorFlow Object Detection API. It is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. Since it is so popular and widely used, I didn’t have to write down a personalized guide as I did for the PyTorch models. Apart from their official GitHub page here, which you can refer to for a more detailed look-over, I went through some good blogs and articles explaining it’s use to fine tune an object detection model. Here’s the one I found most useful:

Training a TensorFlow MobileNet Object Detection Model with a Custom Dataset

This guide walks you through using the TensorFlow object detection API to train a MobileNet Single Shot Detector (v2) to your own dataset. Here’s the complete Jupyter notebook guide for the above article:

Roboflow-tensorflow-object-detection-mobilenet-colab.ipynb

I strongly suggest you to go through it as apart from the usage of the TensorFlow object detection API, it also explains and makes the entire process of preprocessing the training data very easy and automated using Roboflow, which handles the images, annotations, TFRecord file and label_map generation.

Challenges and errors faced

Conversion to the ONNX format from the frozen_inference_graph.pb file - After successfully fine tuning the model following the above guide, two files of importance to us were generated, namely the frozen_inference_graph.pb file and the saved_model.pb file along with some checkpoint files, which were not of great importance to me as my model was now already fine-tuned. Before stating the issue and the process of debugging it, let me first explain the significance of the above files in view of TensorFlow models.
- frozen_inference_graph.pb - It is a frozen graph that cannot be trained anymore, it defines the graph definition and is actually a serialized graph and can be loaded with this code:
```
  def load_graph(frozen_graph_filename):
  with tf.gfile.GFile(frozen_graph_filename, "rb") as f:
      graph_def = tf.GraphDef()
      graph_def.ParseFromString(f.read())
      return graph_def
tf.import_graph_def(load_graph("frozen_inference_graph.pb"))
```
- saved_model.pb - The saved model is a model generated by tf.saved_model.builder and has to be imported into a session, this file contains the full graph with all training weights (just like the frozen graph) but here can be trained upon, and this one is not serialized and needs to be loaded by this snippet:
```
  with tf.Session() as sess:
  tf.saved_model.loader.load(sess, [], "foldername to saved_model.pb, only folder")
```

Both the above files can be independently used to export the model to ONNX format, which initially I was not aware about. I started exporting from the frozen_inference_graph.pb and faced errors as conversion from this even requires the inputs and outputs of the model graph. Googling the issue was of great help and solved it. In short this snippet does the work:

$ python -m tf2onnx.convert --graphdef tensorflow-model-graphdef-file --output model.onnx --inputs input0:0,input1:0 --outputs output0:0

If you do not know the input and output nodes of the model, you can use the summarize_graph TensorFlow utility.

Next, I tried exporting to the ONNX format from the saved_model.pb file. This is what usually most people prefer as in this case the input and output nodes of the model need not to be mentioned. Overall this way is pretty simple, but like I always say, sometimes anything that can go wrong, will go wrong :) I used the following snippet and got a list of ValueErrors:

$ python -m tf2onnx.convert --saved-model tensorflow-model-path --output model.onnx

ValueError: get tensor value: 'Postprocessor/BatchMultiClassNonMaxSuppression/map/while/PadOrClipBoxList/stack_5_Concat__1213' must be Const
ValueError: get tensor value: 'Postprocessor/BatchMultiClassNonMaxSuppression/map/while/PadOrClipBoxList/stack_4_Concat__1241' must be Const
ValueError: get tensor value: 'Add__1257' must be Const
ValueError: get tensor value: 'Postprocessor/BatchMultiClassNonMaxSuppression/map/while/PadOrClipBoxList/stack_1_Concat__1262' must be Const
ValueError: make_sure failure: Cannot find node with output 'Postprocessor/BatchMultiClassNonMaxSuppression/map/while/MultiClassNonMaxSuppression/Slice:0' in graph 'tf2onnx__61'

Debugging this was a headache and took quite some time and Trichotillomania (pulling hair out of frustration :) ), to be honest, even the error trace back lingo seemed quite alien to me to interpret for a Google search. Since the ONNX developer community is very limited as of now, I couldn’t find a way around for this error even after hours of searching and trying different stuff from previously answered online queries revolving around similar issues.
Around this time I read an article which mentioned something about limited supportability of operations in different ONNX opset versions. Something struck me after this, as I had seen a lot of Using opset <onnx, 9> in the terminal during conversion execution before the error. Turns out tf2onnx was using a default ONNX opset version 9 to generate the graph, which did not support some operations specific to my model. Explicitly mentioning --opset 11 in the conversion script finally solved the issue for me.

$ python3 -m tf2onnx.convert --saved-model saved_model --opset 11 --output model.onnx