Hi @yanagar25 when you say you cannot run the quantized version, what kind of error are you running into?
Here’s a notebook that explains how to export a pretrained model to the ONNX format: transformers/04-onnx-export.ipynb at master · huggingface/transformers · GitHub
You can also find more details here: Exporting transformers models — transformers 4.2.0 documentation
I don’t see an obvious reason why the generate
method should not work after quantization, so as with most things in deep learning the best advice is to just try and see if it does