I have finetuned a private variant of Flan-T5 XXL and exported it locally using Optimum.
python -m transformers.onnx --model=model_id --feature=seq2seq-lm-with-past --framework=pt .
Executing this command has generated the following files:
- config.json
- decoder_model.onnx_data (25GB)
- decoder_with_past_model.onnx_data (22GB)
- encoder_model.onnx_data (18GB)
- spiece.model
- tokenizer_config.json
- decoder_model.onnx
- decoder_with_past_model.onnx
- encoder_model.onnx
- special_tokens_map.json
- tokenizer.json
So far so good.
I am now trying to load this model onto a GPU using provider=âCUDAExecutionProviderâ and run out of memory on an A100 (80GB). I have tried creating inference session for encoder_model.onnx, decoder_model.onnx and decoder_with_past_model.onnx.
encoder_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/encoder_model.onnx", providers=['CUDAExecutionProvider'])
decoder_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_model.onnx", providers=['CUDAExecutionProvider'])
decoder_with_past_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_with_past_model.onnx", providers=['CUDAExecutionProvider'])
encoder_sess and decoder_sess are created with no problem. decoder_with_past_sess results in the following error:
Traceback (most recent call last):
File "generate-mirage-onnx-alt.py", line 58, in <module>
decoder_with_past_sess = InferenceSession("/mnt/training/mirage-onnx/no_opt/decoder_with_past_model.onnx", providers=['CUDAExecutionProvider'])
File "/opt/conda/envs/accelerate/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/opt/conda/envs/accelerate/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 395, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 167772160
Thus far I have not run any optimizations during the export of the model to ONNX. Nonetheless the run away memory usage of decoder_with_past_model.onnx is rather strange.
@echarlaix any tips or suggestions would be greatly appreciated.