How to convert hf model to optimized model with kv-caching

Hello,

I am a junior developer who is not yet very familiar with models and optimization. I have made many mistakes along the way, but I am working hard to correct them and learn. Currently, I am trying to optimize the inference speed of the Hugging Face VideoLLaVA model by converting it to TensorRT.

Here is what I have attempted so far:

  1. ONNX Conversion Attempt: Initially, I tried to convert the entire forward flow to ONNX using the torch.jit.trace function, but I was unsuccessful. As a result, I converted each of the four models that make up VideoLLaVA to ONNX individually and then created a custom function to mimic the forward behavior of the VideoLLaVA model.

  2. LLM Conversion Process: While converting the Vicuna-7B model to ONNX, I encountered a limitation where ONNX could not accept tuples as inputs. To work around this, I generated the ONNX file without using any caching mechanisms related to kv-caching.

  3. TensorRT Conversion and Inference: I used TensorRT version 10 to convert the ONNX models to TensorRT and then performed inference. Even though I minimized the IO time between TensorRT engines, the performance was significantly different compared to the Hugging Face model that utilized caching. Specifically, the inference time for the Hugging Face model is approximately 4 seconds, while the custom TRT model takes around 15 seconds.

I have a couple of questions regarding my approach:

  1. When optimizing a multimodal model like VideoLLaVA, is it correct to split each component model and convert them separately to ONNX and then to TRT as I did? Is there a better approach?
  2. Is there a way to use kv-caching when converting Hugging Face models to ONNX or TRT?

I apologize if my questions are somewhat unorganized, but I am reaching out for help to make up for the mistakes I’ve made. Any advice or guidance would be greatly appreciated.

Thank you.