ONNX on GPU memory footprint

ialuronico · January 28, 2023, 6:15pm

I am using ONNX for inference on GPU with GPT models. Even if on disk they use less memory when saved than Pytorch models, their GPU memory footprint is bigger. Is this normal?

model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = ORTModelForCausalLM.from_pretrained(model_name, from_transformers=True, provider='CUDAExecutionProvider')
# Save the ONNX model and tokenizer
model.save_pretrained(save_directory)
# tokenizer.save_pretrained(save_directory)

# Define the quantization methodology
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(save_directory, file_name='decoder_model.onnx')
# Apply dynamic quantization on the model
quantizer.quantize(save_dir=save_directory_quantized, quantization_config=qconfig)
model = ORTModelForCausalLM.from_pretrained(save_directory_quantized, file_name="decoder_model_quantized.onnx", provider='CUDAExecutionProvider')

ialuronico · January 30, 2023, 9:13am

Actually the script above should use decoder_with_past_model.onnx to work rather than decoder_model.onnx. Anyway still the same issue.

Jingya · January 30, 2023, 1:15pm

Hi @ialuronico,

For decoding tasks, currently in Optimum we load two decoder models(decoder and decoder with past) due to the constraint of tracing. It means users need to load both models for decoding(if you want to leverage pre-computed keys/values), which generates extra memory footprints.

We are working on reducing the memory footprint by merging two decoders(#587-merged), and there is a PR under progress to use it for inference, you can track the progress here(#647).

Topic		Replies	Views
ONNX Flan-T5 Model OOM on GPU 🤗Optimum	2	2639	June 15, 2023
How does the ONNX exporter work for GenerationModel with `past_key_value`? 🤗Optimum	9	2410	February 17, 2023
Quantized Model size difference when using Optimum vs. Onnxruntime 🤗Optimum	3	1528	July 14, 2022
When exporting seq2seq models with ONNX, why do we need both decoder_with_past_model.onnx and decoder_model.onnx? 🤗Optimum	12	4598	March 7, 2024
Convert GPT-j to FP-16 Onnx Beginners	4	1357	March 10, 2023

ONNX on GPU memory footprint

Related topics