Static quantization of gpt2-style models with ORTQuantizer

I have a gpt-2 style model that I’m trying to quantize using the Onnx quantizer. When exported to ONNX (using export=True argument to ORTModelForCausalLM.from_pretrained) it has multiple model files, decoder_model.onnx and decoder_with_past_model.onnx. My understanding is that decoder_with_past-model is used for inference with cached keys and values, which is important for my application.

Trying to set up a quantizer for the entire export directory like this fails:

quantizer = ORTQuantizer.from_pretrained(save_directory)
RuntimeError: Found too many ONNX model files in 20M-onnx-static. ORTQuantizer does not support multi-file quantization. Please create separate ORTQuantizer instances for each model/file, by passing the argument `file_name` to ORTQuantizer.from_pretrained().

My understanding is that I can build a quantizer for each constituent model like this:

quantizer = ORTQuantizer.from_pretrained(save_directory,file_name="decoder_model.onnx")

I’m able to get this process to work for the regular decoder_model - I just need to set up a dataset that produces batches of input_ids and attention_mask values, matching the input format of the decoder model, and run calibration. That part works great.

What I don’t understand is how to do the same process for the decoder_with_past_model. Simply following the same procedure as the decoder_model case produces this error:

ValueError: Model requires 14 inputs. Input Feed contains 2

Presumably, these 14 inputs include the cached keys and values? How can I figure out what the expected names and values of these extra inputs are? Is there a way to do this with the standard ORTQuantizer pipeline, or do I need to manually construct batches of cached activations?

Hi @Imnimo, for now quantizing decoder models with past key values is not trivial. It will be much easier soon when we have a single exported file.

In the meantime, 2 solutions:

  1. Use dynamic quantization adding is_static=False in your quantization config.
  2. If you want to perform static quantization as you tried to do, you can see here how the input names are defined and there how to generate dummy inputs. But it is not straightforward :frowning:

Hope that helps!

1 Like

Thanks for the reply!

I had already gotten dynamic quantization working (very smooth, worked great!). It might turn out that the gains from static quantization are not worth the hassle at the moment, but I’ll poke around and see if I can get something working without too much trouble.

1 Like

Hi, did you manage to get the static quantizaiton to work?
I’m working on something similar right now too and having some issues with the static quantization (dynamic was easy indeed)