Questions about ONNX

Matthieu · March 17, 2021, 7:40pm

Hi community,

I have tried the convert_graph_to_onnx.py script to convert one transformer model from PyTorch to ONNX format. I have a few questions :

I have installed onnxrutime-gpu. Does the model generated with the script will be functionning only with GPU or will it work also with CPU onnx rutile? So, do I have to generate one onnx model per device?
Does the ONNX model dependant of the hardware it bas been generated from or do I have to generate the ONNX model on the target hardware where will be run the inference ?
Are the outputs of the ONNX model identical wherever hardware the inference is run on? So, can I use the embeddings generated from the ONNX model but from different hardware platforms?
How can I apply quantization on ONNX model for both CPU and GPU devices ? It seems that the --quantize flag IS deprecated, and I can’t manage to apply dynamical quantize on my ONNX model.

Thanks!

lewtun · March 17, 2021, 9:27pm

Hi @Matthieu,

Here’s a few tentative answers to your questions (I’m somewhat new to using ONNX):

If you are not running optimisations like dynamic quantisation, then the resulting ONNX model should work on both CPU and GPU. You can see in the source code here that convert_graph_to_onnx.py has a convert function that relies on e.g. the native ONNX module in PyTorch. ONNX Runtime is only used in the optimize function here.
Similar to question 1, if you are not applying any optimisations then my understanding is that the resulting model should be hardware-independent (this is meant to be the whole benefit of having a universal format like ONNX :))
Interesting question. I don’t know the answer, but my naive guess is that it depends on which hardware accelerator you’re using (e.g. ONNX Runtime vs something else).
As far as I know, quantisation for GPU is not supported (see issue here), but it definitely is for CPU. What kind of trouble are you running into? One thing you can try is using ONNX Runtime directly with an exported ONNX model as follows:

from onnxruntime.quantization import quantize_dynamic, QuantType
model_input = "onnx/model.onnx"
model_output = "onnx/model.quant.onnx"
quantize_dynamic(model_input, model_output, weight_type=QuantType.QInt8)

There’s also some useful notebooks on the ONNX Runtime repo that I found useful: onnxruntime/PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb at master · microsoft/onnxruntime · GitHub

Matthieu · March 17, 2021, 10:16pm

Hi @lewtun,

Thanks again for the answers.

Are BertOptimizationOptions based on disabling embedding layer norm optimization for better model size reduction https://github.com/huggingface/transformers/blob/master/notebooks/04-onnx-export.ipynb, and mixed precision optimisation hardware-dependant?
Could you precise more what you mean by different hardware acceleration you’re using ?
So I can export my ONNX model with my GPU install of onnx, then use onnxrutime-cpu with the routine you submitted to obtain an quantized onnx model?

Thanks!

Matthieu · March 19, 2021, 10:59am

Hi @lewtun whenever you have time I would be glad to have your feedback

danielbellhv · January 25, 2022, 3:57pm

Is there an equivalent to BertOptimizationOptions for ALBert?

Topic		Replies	Views
Transformers.onnx vs optimum.onnxruntime 🤗Optimum	1	1141	September 12, 2022
Quantized Model size difference when using Optimum vs. Onnxruntime 🤗Optimum	3	1529	July 14, 2022
Supporting ONNX optimized models 🤗Transformers	16	3051	September 1, 2021
Optimizing models using ONNX Models	1	1118	October 21, 2020
Optimum library optimization and quantization fails 🤗Optimum	8	1576	February 22, 2025

Questions about ONNX

Related topics