Increasing Response time for Gradio api

I have created gradio ui ( with routes at each click event listeners ) and it is running on zero space I am using it via gradio API in the mobile application we have made.

It’s working fine in terms of input and output but the issue is its response time. It takes about more 2 mins to fetch output. As per my observation of the console log it only takes 8-9 sec to model to complete the model inference rest of the time is taken for the file sent it the device where the API was called.

When it comes to inferecing, the largest bottleneck is almost always down to network connection. I’ve actually never made a mobile app, but for my chatbot I run it locally via LAN and high speed internet.
I was originally going to use spaces for testing purposes, but for the live website I needed a fster solution. So whilst I don’t use spaces, I suspect that if you wanted to use the app commercially, then enterprise spaces would be as instant as any other app.

I spent a lot of time investigating options for network speeds for inferencing (on a very tight budget) so if you need any info on that I may have a solution. Adding mobile provides another potential bottleneck for the mobiles network connection also but i would expect that to be minimal.

I’m no expert though, so just offering what knowledge i thought would help :slight_smile:

Thanks for replying,

I was trying out certain POCs that required high-end GPUs which I don’t have. I started with Colab notebook (there is t4 GPU) everything worked fine but there was a time limitation of GPU runtime I think it is now 4-5 hrs *. I heard about this HF pro acc. I thought I would deploy the model on zero space and continuously use it via gradio API then got this issue which is still a big problem.

is there any free or low-budget deployment space for a model with a fast response time, keeps it running, and has a good GPU

Oh, i didnt realise you had a pro account. Have you tried ‘Spaces Hardware’?
That will likely have much faster speeds, and the payment plan for the CPU/GPU’s has a sleep mode, so when its not being used, there’s no GPU cost.
Other things to keep in mind:
-caching to avoid repeated inferences if there will be similar ones

  • load balances: to split up the inferencig over resources,
  • Edge: In terms of inferencing, you shouldnt need anything too high end if there isn’t going to be a large number of inferencing at once. I switch hosting between my 4090 and my 4070 super portable PC, and yes there is a difference, but its not as bad as you’d expect. You could get away with a lot less if it was a small model like a llama/Gemma 3’ishB. If you go that route, run it from Workbench WSL if you have an NVIDEA card (just in the WSL, not via workbench), or use any ubuntu WSL, a docker or something like a POP’OS partition.