Gpt2 inference with onnx and quantize

Hey guys,
I’ve managed to create a quantize version of gpt2 using onnxruntime but i don’t seem to be able to run it for some reason. anyone has a tutorial for it? also how does the “generate” method of the model will work with that any ideas?

Hi @yanagar25 when you say you cannot run the quantized version, what kind of error are you running into?

Here’s a notebook that explains how to export a pretrained model to the ONNX format: transformers/04-onnx-export.ipynb at master · huggingface/transformers · GitHub

You can also find more details here: Exporting transformers models — transformers 4.2.0 documentation

I don’t see an obvious reason why the generate method should not work after quantization, so as with most things in deep learning the best advice is to just try and see if it does :slight_smile:

Hey, @lewtun

I am sorry if i am asking something that have been answered before. but in order to run the quantized model i need to run it with onnxruntime.InferenceSession. How does that can be combined with using the generate method? from what i understand i need to copy the entire logic from the generate method and instead of using self(...) use session.run(None, ort_inputs). please correct me if i’m wrong

Ah now I understand better what you’re trying to achieve. Indeed you might have to write your own generate method so that you can integrate the InferenceSession - there’s an example of doing text generation with GPT-2 in the ONNX repo here: onnxruntime/Inference_GPT2_with_OnnxRuntime_on_CPU.ipynb at master · microsoft/onnxruntime · GitHub

You could just adapt their approach to include the generation method you need (beam search, sampling etc)

1 Like

thank you so much for the reply! :slight_smile:

1 Like

FYI there’s a nice section in the docs that explains the various text generation strategies and how they’re implemented: Utilities for Generation — transformers 4.2.0 documentation

1 Like

I will definitely look into it! thank you again :slight_smile:

1 Like