ONNX Flan-T5 Model OOM on GPU

I have finetuned a private variant of Flan-T5 XXL and exported it locally using Optimum.

python -m transformers.onnx --model=model_id --feature=seq2seq-lm-with-past --framework=pt .

Executing this command has generated the following files:

  • config.json
  • decoder_model.onnx_data (25GB)
  • decoder_with_past_model.onnx_data (22GB)
  • encoder_model.onnx_data (18GB)
  • spiece.model
  • tokenizer_config.json
  • decoder_model.onnx
  • decoder_with_past_model.onnx
  • encoder_model.onnx
  • special_tokens_map.json
  • tokenizer.json

So far so good.

I am now trying to load this model onto a GPU using provider=‘CUDAExecutionProvider’ and run out of memory on an A100 (80GB). I have tried creating inference session for encoder_model.onnx, decoder_model.onnx and decoder_with_past_model.onnx.

encoder_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/encoder_model.onnx", providers=['CUDAExecutionProvider'])
decoder_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_model.onnx", providers=['CUDAExecutionProvider'])
decoder_with_past_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_with_past_model.onnx", providers=['CUDAExecutionProvider'])

encoder_sess and decoder_sess are created with no problem. decoder_with_past_sess results in the following error:

Traceback (most recent call last):
  File "generate-mirage-onnx-alt.py", line 58, in <module>
    decoder_with_past_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_with_past_model.onnx", providers=['CUDAExecutionProvider'])
  File "/opt/conda/envs/accelerate/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/opt/conda/envs/accelerate/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 395, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 167772160

Thus far I have not run any optimizations during the export of the model to ONNX. Nonetheless the run away memory usage of decoder_with_past_model.onnx is rather strange.

@echarlaix any tips or suggestions would be greatly appreciated.

In the end I found out that I was not encountering a bug and my GPU was truly OOM.

Novice misunderstanding as I am new to ONNX. For those who are new to ONNX the memory requirements are actually 4-5x that of running a model in the PyTorch format.

@eusip Hi, part of the issue is that a few months ago we were using two separate .onnx for the decoder without/with past, effectively doubling the memory requirement for it. If you try to reexport your model on the latest optimum version with optimum-cli export onnx, you should find a decoder_model_merged.onnx that is automatically used in ORTModelForCausalLM/ORTModelForSeq2SeqLM. Hopefully it can help a bit with memory usage.

But it is true that, overall, ONNX Runtime is a pain when it comes to memory (especially on GPU), see for example: [Performance] Find out why the GPU memory allocated with `CUDAExecutionProvider` is much larger than the ONNX size · Issue #14526 · microsoft/onnxruntime · GitHub