ONNX on GPU memory footprint

I am using ONNX for inference on GPU with GPT models. Even if on disk they use less memory when saved than Pytorch models, their GPU memory footprint is bigger. Is this normal?

model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = ORTModelForCausalLM.from_pretrained(model_name, from_transformers=True, provider='CUDAExecutionProvider')
# Save the ONNX model and tokenizer
# tokenizer.save_pretrained(save_directory)

# Define the quantization methodology
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(save_directory, file_name='decoder_model.onnx')
# Apply dynamic quantization on the model
quantizer.quantize(save_dir=save_directory_quantized, quantization_config=qconfig)
model = ORTModelForCausalLM.from_pretrained(save_directory_quantized, file_name="decoder_model_quantized.onnx", provider='CUDAExecutionProvider')

Actually the script above should use decoder_with_past_model.onnx to work rather than decoder_model.onnx. Anyway still the same issue.

Hi @ialuronico,

For decoding tasks, currently in Optimum we load two decoder models(decoder and decoder with past) due to the constraint of tracing. It means users need to load both models for decoding(if you want to leverage pre-computed keys/values), which generates extra memory footprints.

We are working on reducing the memory footprint by merging two decoders(#587-merged), and there is a PR under progress to use it for inference, you can track the progress here(#647).

1 Like