Accelerated Inference API can't load a model on GPU

Hi, I am trying to run models via accelerated inference API using GPU.
I already subscribed Community Pro plan in order to use GPU.
The inference API works fine if I use CPU.

However, the API cannot load models on GPU.
I used the following code (where MY_API_TOKEN is replaced by my actual API token string):

import requests
import json
API_TOKEN = ‘MY_API_TOKEN’
API_URL = “https://api-inference.huggingface.co/models/Salesforce/codegen-350M-multi
headers = {“Authorization”: f"Bearer {API_TOKEN}"}
payload = {‘inputs’: ‘def hello_world():’, ‘options’: {‘use_gpu’: True}}
data = json.dumps(payload)
response = requests.request(“POST”, API_URL, headers=headers, data=data)
post_result = json.loads(response.content.decode(“utf-8”))
print(post_result)

Then I get the message below no matter how long I wait:

{‘error’: ‘Model Salesforce/codegen-350M-multi is currently loading’, ‘estimated_time’: 31.894704818725586}

What’s wrong with the accelerated inference API?
How can I load models on GPU?

1 Like

cc @Narsil

@nielsr Thanks for your attention to this topic.
I still can’t load a model on GPU.

Meanwhile, I tried to use pinned models using CPU (since it looks like using CPU is fast enough once the model is loaded),
but I think my pinned models are not preloaded becase I get the error message (‘error’: ‘…model is currently loading’) for the first inference.

Could you look into these issues? (GPU and pinned models)

Hi @kernelpanic ! Thanks for reaching out and letting us know about the issue and error message. We’ll take a look to see what’s going on and follow-up soon.
Thanks again!
Michelle

Hi @michellehbn . Today I received a bill including $120 for Pinned Models on CPU and $5 for Pinned Models on GPU, which I could never use as mentioned above :frowning:

Can I get a refund on these pinned models?

Also, I get charged $0.02 for Accelerated Inference API on GPU.
I don’t understand why this is included in the bill because
(1) The description of accelerated inference API looks like there is no extra charge for GPU if I subscribe Community Pro plan, and

  • see “Accelerated inference for a number of supported models on CPU and GPU (GPU requires a Community Pro or Organization Lab plan)” in Overview

(2) I could not use the inference API on GPU as I mentioned in the first post.

Could you explain why $0.02 for Accelerated Inference API on GPU is included in the bill?

Hi @kernelpanic, Pricing for the Inference API is based on the number of characters going through the endpoint: you’ll have up to 30k input characters /month for free (on CPU). Pro plan users will have up to 1M input characters /month for text tasks and up to 2h of audio for audio tasks, then pay-as-you-go with $10/M characters on CPU, $50/M characters on GPU. When you pin a model on CPU it’s $1/day or on GPU it will be $5/day with one day started being one day due.

Would you still like to us to continue working on loading ‘Salesforce/codegen-350M-multi’ on GPU?

Sorry to hear about the pinned models you were unable to use, would you mind sending an email to api-enterprise@huggingface.co with the invoice or invoice number please? Thanks so much!

Thank you so much for the kind reply @michellehbn.
The current pricing page has no information about what you explained.
I think it should be clearly described in the pricing page.
(Please let me know if there is an official pricing page about this.)

I would still like you to continue working on loading the model on GPU.
The reason why I asked for a refund is because I couldn’t use it (not because I don’t need it…).

I sent an email to api-enterprise@huggingface.co.
I will post an update as soon as I receive a response.

Thanks again.

Sure thing @kernelpanic ! I haven’t seen the email come through just yet, do you mind sending to me at michelle@huggingface.co as well please? Sorry for that, I’ll help follow this through.

Hi @michellehbn, I sent another email to michelle@huggingface.co with the invoice number.
Thanks for your support and help.

Hi @kernelpanic! Thanks for bearing with us and sorry for the delay in follow-up! :hugs: We’ve made some changes to the Inference API - this is now free for use, as a solution to easily explore and evaluate models, and Inference Endpoints is our new paid inference solution for production use cases! The free Inference API is subject to rate limiting for heavy use cases.

For larger volumes of requests, or if you need guaranteed latency/performance, you can use our new solution Inference Endpoints to easily deploy your models on dedicated, fully-managed infrastructure. Inference Endpoints will give you the flexibility to quickly create endpoints on CPU or GPU resources, and is billed by compute uptime vs character usage. Further pricing information can be found here.

We’re all ears for feedback so please let us know what you think! Thanks so much!

Hi @michellehbn. Thank you very much for letting me know the update.
It is amazing that you made the Inference API free for use.
I will take a look at the new inference solution as well.

Thank you and have a nice day!

Hello @michellehbn,

Is this still valid for Inference API: With Pro account for 1M character per month and if we hit this limit in the same month we will pay $10 for each M for our usage?

Or do we have to migrate Inference Endpoints?

Hi @metkor! Happy new year! Sorry for my delay in response. The PRO subscription will give you higher Inference API rate limits than the free Inference API plan. The pricing mentioned earlier ($10/M characters, etc) is no longer valid. If you need guaranteed latency/performance, you should use our new solution Inference Endpoints to easily deploy your models on dedicated, fully-managed infrastructure, and is billed by compute uptime vs character usage. Further pricing information can be found here. Please let us know if there are any other questions! Thanks again!