Hi all,
I’ve optimized the finetuned Phi-4 MM Instruct vision model by converting it to ONNX and applying quantization — inference time dropped from 26s ➝ 7s.
I have a few quick questions:
- Audio Removal: Can I safely remove the audio layer if it’s unused? Any tools/docs for stripping unused subgraphs in ONNX?
- TensorRT: Can Phi-4 MM or Phi 3.5-V models be accelerated using TensorRT after ONNX export?
- Further Optimizations: What else can I try to speed up inference?. any heads up would be appreciated