Exceeded GPU quota

Hi I am trying to access a space using gradle(javascript) and I pass my hf token as an argument to client.connect function properly, I am logged in and everything but I get this error:
You have exceeded your GPU quota (59s left vs. 60s requested). <a href="https://huggingface.co/join">Sign-up on Hugging Face</a> to get more quotas or retry in 0:04:12'
Why does it say I am not signed in even when I passed the token? How do I accurately use an api from spaces? (not sure if I explained that right)
I have something like this in nodejs:

 const client = await Client.connect("jkorstad/InstantMesh-img-to-3D", {
      hf_token: "xyz",
});

I am a beginner to all this, I don’t know if getting pro would solve my problem since huggingface can’t even detect my activity I suppose.

I was wrong for a long time, but it seems that in this case sign-in refers to the following functions and not to the HF token or login to HF.
I use Gradio in Python, so I’m sure the implementation steps are different for you and me, but maybe this page will give you a clue.
However, setting this up properly does not eliminate Quota. It is only a mitigation.

Thank you @John6666 but can you please tell me more, I am using a model (Img-to-3D Mesh - a Hugging Face Space by jkorstad), I scrolled to bottom of page and followed “use as api” instructions, then I just added my hf token, I don’t want my users to sign in with a hf account as its an internal application thing, so how do I effectively use this model over an api by properly signing in etc(not extending quota)?
Also once that is done I assume getting a pro plan should increase the number of api requests I can get?

With the exception of sign-in, there should be virtually no way to mitigate Quota in a straightforward manner. Although it is not a recommended means, there are means to reset the Quota count by fast switching VPNs or proxies to change IPs.

Also once that is done I assume getting a pro plan should increase the number of api requests I can get?

I don’t think there is a normal way to increase that as well, other than signing in.
The Serverless Inference API requestable count will definitely increase a bit with Pro tokens, but in this space it looks like too complicated a process to do with Serverless. Additionally, the exact amount of increase has been unclear since 2023 as far as the forum is concerned…

HF is currently seeking user opinions, and some simple ones have already been realized, so if you have any ideas, it might be a good idea to request them.

@John6666 I didn’t quite understand the signin part, you’re saying that if I were to use a model via gradio in my application, I need to ask my users to signin with “their” hf account in the application?
Also how would the pro plan be of any use if hf can’t even detect my activity? (altho I saw last used being refreshed in the access token page)
Should I go ahead and get a pro plan to increase my quota?

I need to ask my users to signin with “their” hf account in the application?

Yes. That method is probably the basis for HF’s assumptions. I don’t know if it is possible to dare to sign in with your Pro account and increase the Quota for all users, but that might be a way to do it. Anyway, everything about the Pro account is vague. This is not only my lack of understanding, but the explanation does not exist anyway.

Also how would the pro plan be of any use if hf can’t even detect my activity?

From the standpoint of simply using Spaces, it is of little use. I heard somewhere that Quota is 5 times higher when you are signed in with Pro, but the overwhelming majority of spaces do not have a sign in button itself…

If you are in the position of creating Spaces, the biggest advantage is that you can use up to 10 Zero GPU spaces at the same time. If this can be ingeniously modified and successfully used via other cloud services, it can be made much cheaper financially.

In the Serverless Inference API, Pro tokens increase the number of models that can be used in addition to the number of requests that can be made. (Llama3 70B, etc.)
However, the specific number of requests has been ambiguous for a long time now.

The same is true for the $20 Enterprise plan, but the benefits of the flat-rate plan are just plain vague. It is unclear whether it is necessary to be vague, or whether they simply don’t realize they are not explaining it well enough.:sweat:

@John6666 Thank You for your response, I want to let you know that I am building a software and I am using certain models over hf api(I think we call them inference api). All of that is being done on backend so my users can enjoy the actual application, I don’t want them to signin to another platform, I mean isn’t that why we are using api in the first place?
To make it clear, I want to use the api in “production” i.e. to be able to handle multiple requests (even if that means buying a plan etc), Unless I am missing what this hf is all about, isn’t it about providing a platform to access high end models through apis and also hosting yours?

I can’t vouch for the correctness of this as I’ve only been here for about six months, but your perception of the HF and API must have been correct.
Some of my forum mates were making such apps.
The reason it’s in the past tense is that the Serverless Inference API, which we can use with Free and Pro subscriptions, has been significantly degraded in recent months. It has simply been turned off for most models.

If you are willing to pay for a paid plan as you say, then using the HF Endpoint is the way to provide a stable service in the long term.
If you avoid the paid plans, then you will have to scrape Gradio to do it.

@John6666 Hey thanks again for your help, hosting the model on a dedicated computer obviously isn’t cost friendly and I thought if there would have been a way to access these models in a shared way (like hf does for free but with more quota), what did you say I can do to avoid paid plans? by fast changing vpns and proxy? that does sound like a good idea but how would you execute on it and please attach any links/content that would be useful for me to achieve that. My software is new and initially I would wanna test out with some users first and not use inference endpoints

what did you say I can do to avoid paid plans? by fast changing vpns and proxy?

Basically, that would be the main method. However, the possibility of measures being taken in the future should be borne in mind.
I myself am not familiar with cloud services, but I heard that if you re-start a process from a cloud service, the IP is usually changed at the same time, so it seems that no particular tool is needed.
So I don’t have any suggestions for special tools in this case.
I do know about old-fashioned underground tools to botnets as an education, but… you know that as well, and I don’t think I can recommend that.

My software is new and initially I would wanna test out with some users first and not use inference endpoints

If it is a well-known model, Serverless should still be available. I’ll go look for a link now.

There is a list of volunteers.

The HF Hub was updated yesterday and it appears that it is now possible to check if an Inference is available.

With warm, this means that it can be used as before.

Hi @geez9999
For more information about API quota you can checkout

you basically get 300 API calls for free per hour.
and with pro you get 1000 API calls per hour.
If you are expecting a higher number of API calls per hour into your application i advise you deploy your model on a dedicated inference endpoint.

1 Like

Unregistered Users 1 request per hour

:sob:

that’s for people without an account I think

Well, no matter how much they just use the app, each user should at least create a HF account…
All they need is a gmail.

1 Like

@not-lain I think I am using a package called @gradio/client from npm to connect to the space from my nodejs server, what you’re talking about is I guess inference api which I don’t see the option of in my space (link provided above), I barely make 1-2 requests and it times out.

1 Like

Did anyone tested this? I’m not sure whether this is applicable for ZeroGPU space. I signed in with my Pro account only to be able to make about 10~20 requests via gradio UI to my ZeroGPU space which generally consume 30s of GPU time in each request. I guess the Inference API should have a more strict limit than Gradio UI?