Potential Memory Leak for ORTModelForCausalLM with TensorRT Providor

Description

Hi there,

first, thank you for the optimum library, it really works well. However, i may found a potential memory leak while working with ORTModelForCausalLM with TensorRT Provider.

We follow closely follow Optimum TensorRT Guide from User Guide Docs and run the following code:

session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
provider_options = {
        "trt_engine_cache_enable": True,
        "trt_engine_cache_path": cache_path,
        "trt_int8_enable": True }
tok = AutoTokenizer.from_pretrained(onnx_path, use_fast=True)
ort_model = ORTModelForCausalLM.from_pretrained(
      onnx_path,
      export=False,
      provider="TensorrtExecutionProvider",
      use_cache=False,
      session_options=session_options,
      provider_options=provider_options,
      log_severity_level=0)

Using this snippet, the onnx model is converted to an engine (probably with trtexec in background), however, it requires a massive memory load. (More than 125GB ram for a 350 Million parameter model). Is there any reason for that or is that expected?

Environment

Docker Image: nvcr.io/nvidia/pytorch:22.12-py3
TensorRT Version 8.501
Python 3.8.10
Pips:
transformers 4.26.1
optimum 1.7.1
onnx 1.12.0
onnxruntime-gpu 1.14.1
pytorch-quantization 2.1.2
pytorch-triton 2.0.0+b8b470bc59
torch 1.13.1
torch-tensorrt 1.3.0
torchtext 0.13.0a0+fae8e8c
torchvision 0.15.0a0

Steps To Reproduce

  1. Convert the model to transformer model to onnx via optimum-cli.
  2. Do the quantization (we did it exactly) like in quantization guide
  3. Run Infer_shape.py as suggested by trtexec.
  4. Run the code from above

Thank you very much for any help!

Hi @cdawg, that does sound like a lot of memory for a 350M-parameter model!
Have you got the chance to compare with a native TRT way of building the engine (i.e. without relying on ORT’s TRT backend)?

Hi Regisss,

yes i tried it with much lower memory requirements. It gets a little bit better if i use workspace size, but it does that the memory requirements are bounded by the workspace size. Am i doing something wrong?

Are you able to reproduce this issue with a model which is publicly accessible on the Hugging Face Hub? So that I can reproduce this issue and start investigating it in the next few days.

Sorry for the late reponse. However, for me the problem is solved. because i switch to accelerate and bitsandbytes.

The model of intrest is codegen-350M-multi