I have a ML model which I want to deploy on the browser. I created a model in PyTorch and then converted it to
onnx to deploy it on browser using
WebGL. But the inference time was high. So, I rewrite the model in TF and deployed it using the TFJS. With TFJS using
WebGL, the inference time was better than
WebGL. But the requirements are for much lower inference time. I have already tried optimizing the model architecture. I tried quantization, pruning, weight clustering but they only reduce the model size but no affect on inference time. Could you please suggest something which can reduce the inference time?
The model looks like this:
So, I take the
x,y co-ordinates. Then I’m using positional encoding on this 2D data to transform to higher space of 42. Next there are 5 Dense layer. The first layer input layer with 42 neurons with sine activation, then 3 hidden layer with 256 neurons each with sine activation and then final layer with 3 neurons.
I’m getting an inference time of around 30sec.