ONNX Flan-T5 Model OOM on GPU

eusip · April 12, 2023, 1:31pm

I have finetuned a private variant of Flan-T5 XXL and exported it locally using Optimum.

python -m transformers.onnx --model=model_id --feature=seq2seq-lm-with-past --framework=pt .

Executing this command has generated the following files:

config.json
decoder_model.onnx_data (25GB)
decoder_with_past_model.onnx_data (22GB)
encoder_model.onnx_data (18GB)
spiece.model
tokenizer_config.json
decoder_model.onnx
decoder_with_past_model.onnx
encoder_model.onnx
special_tokens_map.json
tokenizer.json

So far so good.

I am now trying to load this model onto a GPU using provider=‘CUDAExecutionProvider’ and run out of memory on an A100 (80GB). I have tried creating inference session for encoder_model.onnx, decoder_model.onnx and decoder_with_past_model.onnx.

encoder_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/encoder_model.onnx", providers=['CUDAExecutionProvider'])
decoder_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_model.onnx", providers=['CUDAExecutionProvider'])
decoder_with_past_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_with_past_model.onnx", providers=['CUDAExecutionProvider'])

encoder_sess and decoder_sess are created with no problem. decoder_with_past_sess results in the following error:

Traceback (most recent call last):
  File "generate-mirage-onnx-alt.py", line 58, in <module>
    decoder_with_past_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_with_past_model.onnx", providers=['CUDAExecutionProvider'])
  File "/opt/conda/envs/accelerate/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/opt/conda/envs/accelerate/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 395, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 167772160

Thus far I have not run any optimizations during the export of the model to ONNX. Nonetheless the run away memory usage of decoder_with_past_model.onnx is rather strange.

@echarlaix any tips or suggestions would be greatly appreciated.

eusip · April 13, 2023, 10:49am

In the end I found out that I was not encountering a bug and my GPU was truly OOM.

Novice misunderstanding as I am new to ONNX. For those who are new to ONNX the memory requirements are actually 4-5x that of running a model in the PyTorch format.

fxmarty · June 15, 2023, 5:38am

@eusip Hi, part of the issue is that a few months ago we were using two separate .onnx for the decoder without/with past, effectively doubling the memory requirement for it. If you try to reexport your model on the latest optimum version with optimum-cli export onnx, you should find a decoder_model_merged.onnx that is automatically used in ORTModelForCausalLM/ORTModelForSeq2SeqLM. Hopefully it can help a bit with memory usage.

But it is true that, overall, ONNX Runtime is a pain when it comes to memory (especially on GPU), see for example: [Performance] Find out why the GPU memory allocated with `CUDAExecutionProvider` is much larger than the ONNX size · Issue #14526 · microsoft/onnxruntime · GitHub

Topic		Replies	Views
ONNX on GPU memory footprint 🤗Optimum	2	1431	January 30, 2023
CUDA OOM when export a large model to ONNX 🤗Optimum	5	2138	June 26, 2025
Failed to create CUDAExecutionProvider 🤗Optimum	4	15713	January 31, 2023
Impossible to use flan-t5-xxl in a batch-transform job Amazon SageMaker	3	1148	May 23, 2023
Struggle with finetuneing flan-t5-xxl using deepspeed DeepSpeed	3	848	March 12, 2024

ONNX Flan-T5 Model OOM on GPU

Related topics