Description
Hi there,
first, thank you for the optimum library, it really works well. However, i may found a potential memory leak while working with ORTModelForCausalLM with TensorRT Provider.
We follow closely follow Optimum TensorRT Guide from User Guide Docs and run the following code:
session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
provider_options = {
"trt_engine_cache_enable": True,
"trt_engine_cache_path": cache_path,
"trt_int8_enable": True }
tok = AutoTokenizer.from_pretrained(onnx_path, use_fast=True)
ort_model = ORTModelForCausalLM.from_pretrained(
onnx_path,
export=False,
provider="TensorrtExecutionProvider",
use_cache=False,
session_options=session_options,
provider_options=provider_options,
log_severity_level=0)
Using this snippet, the onnx model is converted to an engine (probably with trtexec in background), however, it requires a massive memory load. (More than 125GB ram for a 350 Million parameter model). Is there any reason for that or is that expected?
Environment
Docker Image: nvcr.io/nvidia/pytorch:22.12-py3
TensorRT Version 8.501
Python 3.8.10
Pips:
transformers 4.26.1
optimum 1.7.1
onnx 1.12.0
onnxruntime-gpu 1.14.1
pytorch-quantization 2.1.2
pytorch-triton 2.0.0+b8b470bc59
torch 1.13.1
torch-tensorrt 1.3.0
torchtext 0.13.0a0+fae8e8c
torchvision 0.15.0a0
Steps To Reproduce
- Convert the model to transformer model to onnx via optimum-cli.
- Do the quantization (we did it exactly) like in quantization guide
- Run Infer_shape.py as suggested by trtexec.
- Run the code from above
Thank you very much for any help!