I have a gpt-2 style model that I’m trying to quantize using the Onnx quantizer. When exported to ONNX (using export=True argument to ORTModelForCausalLM.from_pretrained) it has multiple model files, decoder_model.onnx and decoder_with_past_model.onnx. My understanding is that decoder_with_past-model is used for inference with cached keys and values, which is important for my application.
Trying to set up a quantizer for the entire export directory like this fails:
quantizer = ORTQuantizer.from_pretrained(save_directory)
RuntimeError: Found too many ONNX model files in 20M-onnx-static. ORTQuantizer does not support multi-file quantization. Please create separate ORTQuantizer instances for each model/file, by passing the argument `file_name` to ORTQuantizer.from_pretrained().
My understanding is that I can build a quantizer for each constituent model like this:
quantizer = ORTQuantizer.from_pretrained(save_directory,file_name="decoder_model.onnx")
I’m able to get this process to work for the regular decoder_model - I just need to set up a dataset that produces batches of input_ids and attention_mask values, matching the input format of the decoder model, and run calibration. That part works great.
What I don’t understand is how to do the same process for the decoder_with_past_model. Simply following the same procedure as the decoder_model case produces this error:
ValueError: Model requires 14 inputs. Input Feed contains 2
Presumably, these 14 inputs include the cached keys and values? How can I figure out what the expected names and values of these extra inputs are? Is there a way to do this with the standard ORTQuantizer pipeline, or do I need to manually construct batches of cached activations?