🔧 Optimizing Phi-4 MM Instruct Vision Model (ONNX Inference)

Meemank · April 24, 2025, 12:14pm

Hi all,
I’ve optimized the finetuned Phi-4 MM Instruct vision model by converting it to ONNX and applying quantization — inference time dropped from 26s ➝ 7s.

I have a few quick questions:

Audio Removal: Can I safely remove the audio layer if it’s unused? Any tools/docs for stripping unused subgraphs in ONNX?
TensorRT: Can Phi-4 MM or Phi 3.5-V models be accelerated using TensorRT after ONNX export?
Further Optimizations: What else can I try to speed up inference?. any heads up would be appreciated

John6666 · April 24, 2025, 12:58pm

1

I’m not sure how useful it is, but there seems to be a tool for deleting layers.

2

It seems to have a backend. It also comes with a conversion tool.

3

Optimization proposals by ONNX are summarized.

Topic		Replies	Views
Help with Quantizing phi-4 MM Fine-Tuned Vision Model and Converting to ONNX Intermediate	3	75	May 2, 2025
Convert GPT-j to FP-16 Onnx Beginners	4	1357	March 10, 2023
How to decrease inference time of model Models	0	430	February 2, 2023
How to optimize ONNX seq2seq model? 🤗Optimum	2	2133	August 25, 2022
Transformers.onnx vs optimum.onnxruntime 🤗Optimum	1	1138	September 12, 2022

🔧 Optimizing Phi-4 MM Instruct Vision Model (ONNX Inference)

Related topics