I am using ONNX for inference on GPU with GPT models. Even if on disk they use less memory when saved than Pytorch models, their GPU memory footprint is bigger. Is this normal?
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForCausalLM.from_pretrained(model_name, from_transformers=True, provider='CUDAExecutionProvider')
# Save the ONNX model and tokenizer
model.save_pretrained(save_directory)
# tokenizer.save_pretrained(save_directory)
# Define the quantization methodology
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(save_directory, file_name='decoder_model.onnx')
# Apply dynamic quantization on the model
quantizer.quantize(save_dir=save_directory_quantized, quantization_config=qconfig)
model = ORTModelForCausalLM.from_pretrained(save_directory_quantized, file_name="decoder_model_quantized.onnx", provider='CUDAExecutionProvider')