Accelerated Inference API can't load a model on GPU

Hi, I am trying to run models via accelerated inference API using GPU.
I already subscribed Community Pro plan in order to use GPU.
The inference API works fine if I use CPU.

However, the API cannot load models on GPU.
I used the following code (where MY_API_TOKEN is replaced by my actual API token string):

import requests
import json
API_TOKEN = ‘MY_API_TOKEN’
API_URL = “https://api-inference.huggingface.co/models/Salesforce/codegen-350M-multi
headers = {“Authorization”: f"Bearer {API_TOKEN}"}
payload = {‘inputs’: ‘def hello_world():’, ‘options’: {‘use_gpu’: True}}
data = json.dumps(payload)
response = requests.request(“POST”, API_URL, headers=headers, data=data)
post_result = json.loads(response.content.decode(“utf-8”))
print(post_result)

Then I get the message below no matter how long I wait:

{‘error’: ‘Model Salesforce/codegen-350M-multi is currently loading’, ‘estimated_time’: 31.894704818725586}

What’s wrong with the accelerated inference API?
How can I load models on GPU?

1 Like

cc @Narsil

@nielsr Thanks for your attention to this topic.
I still can’t load a model on GPU.

Meanwhile, I tried to use pinned models using CPU (since it looks like using CPU is fast enough once the model is loaded),
but I think my pinned models are not preloaded becase I get the error message (‘error’: ‘…model is currently loading’) for the first inference.

Could you look into these issues? (GPU and pinned models)

Hi @kernelpanic ! Thanks for reaching out and letting us know about the issue and error message. We’ll take a look to see what’s going on and follow-up soon.
Thanks again!
Michelle

Hi @michellehbn . Today I received a bill including $120 for Pinned Models on CPU and $5 for Pinned Models on GPU, which I could never use as mentioned above :frowning:

Can I get a refund on these pinned models?

Also, I get charged $0.02 for Accelerated Inference API on GPU.
I don’t understand why this is included in the bill because
(1) The description of accelerated inference API looks like there is no extra charge for GPU if I subscribe Community Pro plan, and

  • see “Accelerated inference for a number of supported models on CPU and GPU (GPU requires a Community Pro or Organization Lab plan)” in Overview

(2) I could not use the inference API on GPU as I mentioned in the first post.

Could you explain why $0.02 for Accelerated Inference API on GPU is included in the bill?

Hi @kernelpanic, Pricing for the Inference API is based on the number of characters going through the endpoint: you’ll have up to 30k input characters /month for free (on CPU). Pro plan users will have up to 1M input characters /month for text tasks and up to 2h of audio for audio tasks, then pay-as-you-go with $10/M characters on CPU, $50/M characters on GPU. When you pin a model on CPU it’s $1/day or on GPU it will be $5/day with one day started being one day due.

Would you still like to us to continue working on loading ‘Salesforce/codegen-350M-multi’ on GPU?

Sorry to hear about the pinned models you were unable to use, would you mind sending an email to api-enterprise@huggingface.co with the invoice or invoice number please? Thanks so much!

Thank you so much for the kind reply @michellehbn.
The current pricing page has no information about what you explained.
I think it should be clearly described in the pricing page.
(Please let me know if there is an official pricing page about this.)

I would still like you to continue working on loading the model on GPU.
The reason why I asked for a refund is because I couldn’t use it (not because I don’t need it…).

I sent an email to api-enterprise@huggingface.co.
I will post an update as soon as I receive a response.

Thanks again.

Sure thing @kernelpanic ! I haven’t seen the email come through just yet, do you mind sending to me at michelle@huggingface.co as well please? Sorry for that, I’ll help follow this through.

Hi @michellehbn, I sent another email to michelle@huggingface.co with the invoice number.
Thanks for your support and help.