I have tried the convert_graph_to_onnx.py script to convert one transformer model from PyTorch to ONNX format. I have a few questions :
I have installed onnxrutime-gpu. Does the model generated with the script will be functionning only with GPU or will it work also with CPU onnx rutile? So, do I have to generate one onnx model per device?
Does the ONNX model dependant of the hardware it bas been generated from or do I have to generate the ONNX model on the target hardware where will be run the inference ?
Are the outputs of the ONNX model identical wherever hardware the inference is run on? So, can I use the embeddings generated from the ONNX model but from different hardware platforms?
How can I apply quantization on ONNX model for both CPU and GPU devices ? It seems that the --quantize flag IS deprecated, and I can’t manage to apply dynamical quantize on my ONNX model.
Here’s a few tentative answers to your questions (I’m somewhat new to using ONNX):
If you are not running optimisations like dynamic quantisation, then the resulting ONNX model should work on both CPU and GPU. You can see in the source code here that convert_graph_to_onnx.py has a convert function that relies on e.g. the native ONNX module in PyTorch. ONNX Runtime is only used in the optimize function here.
Similar to question 1, if you are not applying any optimisations then my understanding is that the resulting model should be hardware-independent (this is meant to be the whole benefit of having a universal format like ONNX :))
Interesting question. I don’t know the answer, but my naive guess is that it depends on which hardware accelerator you’re using (e.g. ONNX Runtime vs something else).
As far as I know, quantisation for GPU is not supported (see issue here), but it definitely is for CPU. What kind of trouble are you running into? One thing you can try is using ONNX Runtime directly with an exported ONNX model as follows: