Questions about ONNX

Hi community,

I have tried the convert_graph_to_onnx.py script to convert one transformer model from PyTorch to ONNX format. I have a few questions :

  1. I have installed onnxrutime-gpu. Does the model generated with the script will be functionning only with GPU or will it work also with CPU onnx rutile? So, do I have to generate one onnx model per device?

  2. Does the ONNX model dependant of the hardware it bas been generated from or do I have to generate the ONNX model on the target hardware where will be run the inference ?

  3. Are the outputs of the ONNX model identical wherever hardware the inference is run on? So, can I use the embeddings generated from the ONNX model but from different hardware platforms?

  4. How can I apply quantization on ONNX model for both CPU and GPU devices ? It seems that the --quantize flag IS deprecated, and I can’t manage to apply dynamical quantize on my ONNX model.

Thanks!

1 Like

Hi @Matthieu,

Here’s a few tentative answers to your questions (I’m somewhat new to using ONNX):

  1. If you are not running optimisations like dynamic quantisation, then the resulting ONNX model should work on both CPU and GPU. You can see in the source code here that convert_graph_to_onnx.py has a convert function that relies on e.g. the native ONNX module in PyTorch. ONNX Runtime is only used in the optimize function here.
  2. Similar to question 1, if you are not applying any optimisations then my understanding is that the resulting model should be hardware-independent (this is meant to be the whole benefit of having a universal format like ONNX :))
  3. Interesting question. I don’t know the answer, but my naive guess is that it depends on which hardware accelerator you’re using (e.g. ONNX Runtime vs something else).
  4. As far as I know, quantisation for GPU is not supported (see issue here), but it definitely is for CPU. What kind of trouble are you running into? One thing you can try is using ONNX Runtime directly with an exported ONNX model as follows:
from onnxruntime.quantization import quantize_dynamic, QuantType
model_input = "onnx/model.onnx"
model_output = "onnx/model.quant.onnx"
quantize_dynamic(model_input, model_output, weight_type=QuantType.QInt8)

There’s also some useful notebooks on the ONNX Runtime repo that I found useful: onnxruntime/PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb at master · microsoft/onnxruntime · GitHub

Hi @lewtun,

Thanks again for the answers.

  1. Are BertOptimizationOptions based on disabling embedding layer norm optimization for better model size reduction https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb, and mixed precision optimisation hardware-dependant?
  2. Could you precise more what you mean by different hardware acceleration you’re using ?
  3. So I can export my ONNX model with my GPU install of onnx, then use onnxrutime-cpu with the routine you submitted to obtain an quantized onnx model?

Thanks!

Hi @lewtun whenever you have time I would be glad to have your feedback :slight_smile:

Is there an equivalent to BertOptimizationOptions for ALBert?