Static quantization of gpt2-style models with ORTQuantizer

Imnimo · August 8, 2023, 6:45pm

I have a gpt-2 style model that I’m trying to quantize using the Onnx quantizer. When exported to ONNX (using export=True argument to ORTModelForCausalLM.from_pretrained) it has multiple model files, decoder_model.onnx and decoder_with_past_model.onnx. My understanding is that decoder_with_past-model is used for inference with cached keys and values, which is important for my application.

Trying to set up a quantizer for the entire export directory like this fails:

quantizer = ORTQuantizer.from_pretrained(save_directory)

RuntimeError: Found too many ONNX model files in 20M-onnx-static. ORTQuantizer does not support multi-file quantization. Please create separate ORTQuantizer instances for each model/file, by passing the argument `file_name` to ORTQuantizer.from_pretrained().

My understanding is that I can build a quantizer for each constituent model like this:

quantizer = ORTQuantizer.from_pretrained(save_directory,file_name="decoder_model.onnx")

I’m able to get this process to work for the regular decoder_model - I just need to set up a dataset that produces batches of input_ids and attention_mask values, matching the input format of the decoder model, and run calibration. That part works great.

What I don’t understand is how to do the same process for the decoder_with_past_model. Simply following the same procedure as the decoder_model case produces this error:

ValueError: Model requires 14 inputs. Input Feed contains 2

Presumably, these 14 inputs include the cached keys and values? How can I figure out what the expected names and values of these extra inputs are? Is there a way to do this with the standard ORTQuantizer pipeline, or do I need to manually construct batches of cached activations?

regisss · August 9, 2023, 5:56pm

Hi @Imnimo, for now quantizing decoder models with past key values is not trivial. It will be much easier soon when we have a single exported file.

In the meantime, 2 solutions:

Use dynamic quantization adding is_static=False in your quantization config.
If you want to perform static quantization as you tried to do, you can see here how the input names are defined and there how to generate dummy inputs. But it is not straightforward

Hope that helps!

Imnimo · August 9, 2023, 6:21pm

Thanks for the reply!

I had already gotten dynamic quantization working (very smooth, worked great!). It might turn out that the gains from static quantization are not worth the hassle at the moment, but I’ll poke around and see if I can get something working without too much trouble.

IdoAmit198 · September 18, 2023, 1:00pm

Hi, did you manage to get the static quantizaiton to work?
I’m working on something similar right now too and having some issues with the static quantization (dynamic was easy indeed)

Topic		Replies	Views
Regarding Quantizing gpt2-xl, gpt2-large, &c 🤗Optimum	2	1345	August 10, 2022
ONNX on GPU memory footprint 🤗Optimum	2	1432	January 30, 2023
When exporting seq2seq models with ONNX, why do we need both decoder_with_past_model.onnx and decoder_model.onnx? 🤗Optimum	12	4582	March 7, 2024
Dynamic quantization problems 🤗Optimum	4	2240	October 16, 2022
Gpt2 inference with onnx and quantize Beginners	6	3844	February 3, 2021

Static quantization of gpt2-style models with ORTQuantizer

Related topics