Increasing Response time for Gradio api

spdraptor · September 2, 2024, 5:20am

I have created gradio ui ( with routes at each click event listeners ) and it is running on zero space I am using it via gradio API in the mobile application we have made.

It’s working fine in terms of input and output but the issue is its response time. It takes about more 2 mins to fetch output. As per my observation of the console log it only takes 8-9 sec to model to complete the model inference rest of the time is taken for the file sent it the device where the API was called.

DataImaginations · September 2, 2024, 2:56pm

When it comes to inferecing, the largest bottleneck is almost always down to network connection. I’ve actually never made a mobile app, but for my chatbot I run it locally via LAN and high speed internet.
I was originally going to use spaces for testing purposes, but for the live website I needed a fster solution. So whilst I don’t use spaces, I suspect that if you wanted to use the app commercially, then enterprise spaces would be as instant as any other app.

I spent a lot of time investigating options for network speeds for inferencing (on a very tight budget) so if you need any info on that I may have a solution. Adding mobile provides another potential bottleneck for the mobiles network connection also but i would expect that to be minimal.

I’m no expert though, so just offering what knowledge i thought would help

spdraptor · September 3, 2024, 10:53am

Thanks for replying,

I was trying out certain POCs that required high-end GPUs which I don’t have. I started with Colab notebook (there is t4 GPU) everything worked fine but there was a time limitation of GPU runtime I think it is now 4-5 hrs *. I heard about this HF pro acc. I thought I would deploy the model on zero space and continuously use it via gradio API then got this issue which is still a big problem.

is there any free or low-budget deployment space for a model with a fast response time, keeps it running, and has a good GPU

DataImaginations · September 6, 2024, 10:51pm

Oh, i didnt realise you had a pro account. Have you tried ‘Spaces Hardware’?
That will likely have much faster speeds, and the payment plan for the CPU/GPU’s has a sleep mode, so when its not being used, there’s no GPU cost.
Other things to keep in mind:
-caching to avoid repeated inferences if there will be similar ones

load balances: to split up the inferencig over resources,
Edge: In terms of inferencing, you shouldnt need anything too high end if there isn’t going to be a large number of inferencing at once. I switch hosting between my 4090 and my 4070 super portable PC, and yes there is a difference, but its not as bad as you’d expect. You could get away with a lot less if it was a small model like a llama/Gemma 3’ishB. If you go that route, run it from Workbench WSL if you have an NVIDEA card (just in the WSL, not via workbench), or use any ubuntu WSL, a docker or something like a POP’OS partition.

Topic		Replies	Views
Need help with deploying my model on spaces Spaces	1	134	November 21, 2024
Gradio spaces app error Spaces	7	3971	August 18, 2023
Signed up for pro - pro tokens not working Site Feedback	6	60	February 19, 2025
Deploy model on HF Space for production Spaces	0	985	March 11, 2022
Using gpt-j-6B in a CPU space without the InferenceAPI Spaces	0	2278	January 28, 2022

Increasing Response time for Gradio api

Related topics