How to decrease inference time of model

unigatchi · February 2, 2023, 3:17am

I have a ML model which I want to deploy on the browser. I created a model in PyTorch and then converted it to onnx to deploy it on browser using wasm and WebGL. But the inference time was high. So, I rewrite the model in TF and deployed it using the TFJS. With TFJS using WebGL, the inference time was better than onnx using WebGL. But the requirements are for much lower inference time. I have already tried optimizing the model architecture. I tried quantization, pruning, weight clustering but they only reduce the model size but no affect on inference time. Could you please suggest something which can reduce the inference time?

The model looks like this:
So, I take the x,y co-ordinates. Then I’m using positional encoding on this 2D data to transform to higher space of 42. Next there are 5 Dense layer. The first layer input layer with 42 neurons with sine activation, then 3 hidden layer with 256 neurons each with sine activation and then final layer with 3 neurons.
I’m getting an inference time of around 30sec.

Thanks

Topic		Replies	Views
Boost inference speed of T5 models up to 5X & reduce the model size by 3X 🤗Transformers	2	5601	June 8, 2023
Improving decoding speed by onnx conversion model Beginners	0	241	November 17, 2021
How to Improve inference time of facebook/mbart many to many model? 🤗Transformers	5	1886	October 4, 2022
Deploying inference model size and performance 🤗Transformers	6	5186	July 9, 2024
🔧 Optimizing Phi-4 MM Instruct Vision Model (ONNX Inference) Intermediate	1	48	April 24, 2025

How to decrease inference time of model

Related topics