Potential Memory Leak for ORTModelForCausalLM with TensorRT Providor

cdawg · March 22, 2023, 2:25pm

Description

Hi there,

first, thank you for the optimum library, it really works well. However, i may found a potential memory leak while working with ORTModelForCausalLM with TensorRT Provider.

We follow closely follow Optimum TensorRT Guide from User Guide Docs and run the following code:

session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
provider_options = {
        "trt_engine_cache_enable": True,
        "trt_engine_cache_path": cache_path,
        "trt_int8_enable": True }
tok = AutoTokenizer.from_pretrained(onnx_path, use_fast=True)
ort_model = ORTModelForCausalLM.from_pretrained(
      onnx_path,
      export=False,
      provider="TensorrtExecutionProvider",
      use_cache=False,
      session_options=session_options,
      provider_options=provider_options,
      log_severity_level=0)

Using this snippet, the onnx model is converted to an engine (probably with trtexec in background), however, it requires a massive memory load. (More than 125GB ram for a 350 Million parameter model). Is there any reason for that or is that expected?

Environment

Docker Image: nvcr.io/nvidia/pytorch:22.12-py3
TensorRT Version 8.501
Python 3.8.10
Pips:
transformers 4.26.1
optimum 1.7.1
onnx 1.12.0
onnxruntime-gpu 1.14.1
pytorch-quantization 2.1.2
pytorch-triton 2.0.0+b8b470bc59
torch 1.13.1
torch-tensorrt 1.3.0
torchtext 0.13.0a0+fae8e8c
torchvision 0.15.0a0

Steps To Reproduce

Convert the model to transformer model to onnx via optimum-cli.
Do the quantization (we did it exactly) like in quantization guide
Run Infer_shape.py as suggested by trtexec.
Run the code from above

Thank you very much for any help!

regisss · May 3, 2023, 11:59am

Hi @cdawg, that does sound like a lot of memory for a 350M-parameter model!
Have you got the chance to compare with a native TRT way of building the engine (i.e. without relying on ORT’s TRT backend)?

cdawg · May 9, 2023, 7:15am

Hi Regisss,

yes i tried it with much lower memory requirements. It gets a little bit better if i use workspace size, but it does that the memory requirements are bounded by the workspace size. Am i doing something wrong?

regisss · May 9, 2023, 7:33am

Are you able to reproduce this issue with a model which is publicly accessible on the Hugging Face Hub? So that I can reproduce this issue and start investigating it in the next few days.

cdawg · June 2, 2023, 8:49am

Sorry for the late reponse. However, for me the problem is solved. because i switch to accelerate and bitsandbytes.

The model of intrest is codegen-350M-multi

Topic		Replies	Views
ONNX on GPU memory footprint 🤗Optimum	2	1429	January 30, 2023
Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf Models	0	930	November 10, 2023
ONNX Flan-T5 Model OOM on GPU 🤗Optimum	2	2628	June 15, 2023
Transformers.onnx vs optimum.onnxruntime 🤗Optimum	1	1127	September 12, 2022
CUDA OOM when export a large model to ONNX 🤗Optimum	5	2130	June 26, 2025

Potential Memory Leak for ORTModelForCausalLM with TensorRT Providor

Description

Environment

Steps To Reproduce

Related topics