Accelerated Inference API can't load a model on GPU

kernelpanic · August 29, 2022, 5:37am

Hi, I am trying to run models via accelerated inference API using GPU.
I already subscribed Community Pro plan in order to use GPU.
The inference API works fine if I use CPU.

However, the API cannot load models on GPU.
I used the following code (where MY_API_TOKEN is replaced by my actual API token string):

import requests
import json
API_TOKEN = ‘MY_API_TOKEN’
API_URL = “https://api-inference.huggingface.co/models/Salesforce/codegen-350M-multi”
headers = {“Authorization”: f"Bearer {API_TOKEN}"}
payload = {‘inputs’: ‘def hello_world():’, ‘options’: {‘use_gpu’: True}}
data = json.dumps(payload)
response = requests.request(“POST”, API_URL, headers=headers, data=data)
post_result = json.loads(response.content.decode(“utf-8”))
print(post_result)

Then I get the message below no matter how long I wait:

{‘error’: ‘Model Salesforce/codegen-350M-multi is currently loading’, ‘estimated_time’: 31.894704818725586}

What’s wrong with the accelerated inference API?
How can I load models on GPU?

nielsr · August 29, 2022, 4:23pm

cc @Narsil

kernelpanic · September 7, 2022, 8:41am

@nielsr Thanks for your attention to this topic.
I still can’t load a model on GPU.

Meanwhile, I tried to use pinned models using CPU (since it looks like using CPU is fast enough once the model is loaded),
but I think my pinned models are not preloaded becase I get the error message (‘error’: ‘…model is currently loading’) for the first inference.

Could you look into these issues? (GPU and pinned models)

michellehbn · September 7, 2022, 3:27pm

Hi @kernelpanic ! Thanks for reaching out and letting us know about the issue and error message. We’ll take a look to see what’s going on and follow-up soon.
Thanks again!
Michelle

kernelpanic · September 22, 2022, 7:21am

Hi @michellehbn . Today I received a bill including $120 for Pinned Models on CPU and $5 for Pinned Models on GPU, which I could never use as mentioned above

Can I get a refund on these pinned models?

kernelpanic · September 22, 2022, 7:29am

Also, I get charged $0.02 for Accelerated Inference API on GPU.
I don’t understand why this is included in the bill because
(1) The description of accelerated inference API looks like there is no extra charge for GPU if I subscribe Community Pro plan, and

see “Accelerated inference for a number of supported models on CPU and GPU (GPU requires a Community Pro or Organization Lab plan)” in Overview

(2) I could not use the inference API on GPU as I mentioned in the first post.

Could you explain why $0.02 for Accelerated Inference API on GPU is included in the bill?

michellehbn · September 22, 2022, 9:07am

Hi @kernelpanic, Pricing for the Inference API is based on the number of characters going through the endpoint: you’ll have up to 30k input characters /month for free (on CPU). Pro plan users will have up to 1M input characters /month for text tasks and up to 2h of audio for audio tasks, then pay-as-you-go with $10/M characters on CPU, $50/M characters on GPU. When you pin a model on CPU it’s $1/day or on GPU it will be $5/day with one day started being one day due.

Would you still like to us to continue working on loading ‘Salesforce/codegen-350M-multi’ on GPU?

Sorry to hear about the pinned models you were unable to use, would you mind sending an email to api-enterprise@huggingface.co with the invoice or invoice number please? Thanks so much!

kernelpanic · September 22, 2022, 10:30am

Thank you so much for the kind reply @michellehbn.
The current pricing page has no information about what you explained.
I think it should be clearly described in the pricing page.
(Please let me know if there is an official pricing page about this.)

I would still like you to continue working on loading the model on GPU.
The reason why I asked for a refund is because I couldn’t use it (not because I don’t need it…).

I sent an email to api-enterprise@huggingface.co.
I will post an update as soon as I receive a response.

Thanks again.

michellehbn · September 22, 2022, 12:35pm

Sure thing @kernelpanic ! I haven’t seen the email come through just yet, do you mind sending to me at michelle@huggingface.co as well please? Sorry for that, I’ll help follow this through.

kernelpanic · September 23, 2022, 1:20am

Hi @michellehbn, I sent another email to michelle@huggingface.co with the invoice number.
Thanks for your support and help.

michellehbn · October 25, 2022, 1:25pm

Hi @kernelpanic! Thanks for bearing with us and sorry for the delay in follow-up! We’ve made some changes to the Inference API - this is now free for use, as a solution to easily explore and evaluate models, and Inference Endpoints is our new paid inference solution for production use cases! The free Inference API is subject to rate limiting for heavy use cases.

For larger volumes of requests, or if you need guaranteed latency/performance, you can use our new solution Inference Endpoints to easily deploy your models on dedicated, fully-managed infrastructure. Inference Endpoints will give you the flexibility to quickly create endpoints on CPU or GPU resources, and is billed by compute uptime vs character usage. Further pricing information can be found here.

We’re all ears for feedback so please let us know what you think! Thanks so much!

kernelpanic · October 26, 2022, 1:40am

Hi @michellehbn. Thank you very much for letting me know the update.
It is amazing that you made the Inference API free for use.
I will take a look at the new inference solution as well.

Thank you and have a nice day!

metkor · January 5, 2023, 12:46pm

Hello @michellehbn,

Is this still valid for Inference API: With Pro account for 1M character per month and if we hit this limit in the same month we will pay $10 for each M for our usage?

Or do we have to migrate Inference Endpoints?

michellehbn · January 16, 2023, 11:06am

Hi @metkor! Happy new year! Sorry for my delay in response. The PRO subscription will give you higher Inference API rate limits than the free Inference API plan. The pricing mentioned earlier ($10/M characters, etc) is no longer valid. If you need guaranteed latency/performance, you should use our new solution Inference Endpoints to easily deploy your models on dedicated, fully-managed infrastructure, and is billed by compute uptime vs character usage. Further pricing information can be found here. Please let us know if there are any other questions! Thanks again!

Topic		Replies	Views
Trouble Invoking GPU-Accelerated Inference Beginners	5	1459	October 3, 2022
API Rest with several models loaded using GPU but not at same time Beginners	1	401	June 10, 2021
Executing pinned inference model Models	1	312	May 4, 2023
Backend for the hub models executed by widgets 🤗Hub	1	642	December 8, 2021
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9605	October 16, 2024

Accelerated Inference API can't load a model on GPU

Related topics