🔧 Optimizing Phi-4 MM Instruct Vision Model (ONNX Inference)

Hi all,
I’ve optimized the finetuned Phi-4 MM Instruct vision model by converting it to ONNX and applying quantization — inference time dropped from 26s ➝ 7s. :tada:

I have a few quick questions:

  1. Audio Removal: Can I safely remove the audio layer if it’s unused? Any tools/docs for stripping unused subgraphs in ONNX?
  2. TensorRT: Can Phi-4 MM or Phi 3.5-V models be accelerated using TensorRT after ONNX export?
  3. Further Optimizations: What else can I try to speed up inference?. any heads up would be appreciated
1 Like

1

I’m not sure how useful it is, but there seems to be a tool for deleting layers.

2

It seems to have a backend. It also comes with a conversion tool.

3

Optimization proposals by ONNX are summarized.

1 Like