Oh, i didnt realise you had a pro account. Have you tried ‘Spaces Hardware’?
That will likely have much faster speeds, and the payment plan for the CPU/GPU’s has a sleep mode, so when its not being used, there’s no GPU cost.
Other things to keep in mind:
-caching to avoid repeated inferences if there will be similar ones
- load balances: to split up the inferencig over resources,
- Edge: In terms of inferencing, you shouldnt need anything too high end if there isn’t going to be a large number of inferencing at once. I switch hosting between my 4090 and my 4070 super portable PC, and yes there is a difference, but its not as bad as you’d expect. You could get away with a lot less if it was a small model like a llama/Gemma 3’ishB. If you go that route, run it from Workbench WSL if you have an NVIDEA card (just in the WSL, not via workbench), or use any ubuntu WSL, a docker or something like a POP’OS partition.