Error executing pinned inference model

Hello @julien-c!

Last week, I uploaded a private Text Generation model to my Huggingface account…

And then I enabled pinning on that model in our account here:

But when I try to execute an API call on this model, I always get an error message.

The API call looks like this…

curl -X POST \
     -H "Authorization: Bearer <<REDACTED>>" \
     -H "Content-Type: application/json" \
     -d \
          "inputs":"Once upon a time, there was a grumpy old toad who",
          "parameters": {"max_length": 500}

And the error is:

{"error":"We waited for too long for model shaxpir/prosecraft_linear_43195 to load. Please retry later or contact us. For very large models, this might be expected."}

I’ve been trying repeatedly, and waiting long intervals, but I still get this error every time.

It is quite a large model, but there are other larger models on public model cards that don’t seem to suffer from this problem. And I don’t see any documentation about model-size limitations for pinned private models (on CPU or GPU). Is there any guidance on that topic? Or is there anything that the support team can do to help me get un-stuck?

(Also, the “Pricing” page says that paid “Lab” plans come with email support, but the email address doesn’t seem to be published anywhere… I tried emailing but got no response for 9 days. And the obvious bounced back to me… Can you let me know where to send support emails?)

Thank you so much!!

@Narsil for guidance

1 Like


Thanks for this report. These are large models and are not deployed automatically.
There was indeed a bug in the pinning system that prevented you from seeing a nice error message.

I tried to load it manually for you, but it seems to be missing it’s tokenizer, so the API cannot work out of the box. Do you think you could add the missing files so it can work ?

Also for such a large model, and looking at the max_length you’re expecting, it’s unlikely that CPU will be enough to reply in a timely fashion, using GPU will be required (and most likely FP16 too, can you confirm if your model is FP16 enabled ? By default we don’t make the assumption to prevent showing incorrect results).

Thank you very much !


@Narsil I’ve added the tokenizer files, could you try manually loading it again please? Re FP16 yep the model can be used in FP16.

(I’m helping @benjismith on this :slight_smile: )

Okay @Narsil, thank you for your help! We updated the models with the tokenizer files, but I still can’t get the inference API to return a result. CPU-pinning still results in the same error message about waiting too long. And GPU pinning seems to work in the web UI (e.g., clicking the “PIN” button and choosing “GPU”) but when I try to invoke the inference API like this:

curl -X POST \
     -H "Authorization: Bearer <<REDACTED>>" \
     -H "Content-Type: application/json" \
     -d \
          "inputs" : "Once upon a time, there was a grumpy old toad who",
          "options": {
            "wait_for_model" : true,
            "use_gpu" : true
          "parameters" : {"max_length": 10}

…I get this error:

{"error":"You are not allowed to run GPU requests, please check your subscription plan on or contact"}

But I have the $200 “Lab” plan, and your pricing page says:

Pin models on CPU or GPU

Instant availability for inference - $50/mo on CPU, $200/mo on GPU

Can you help me resolve this?

(Also, I tried to cancel-and-restart my plan, thinking that might help, but I can’t restart it until the billing period ends… So if you can un-cancel my plan while you’re fixing the other problem, that’d be great!)

Thank you!!

@Narsil Any chance you can help with this today?


@Narsil Checking in again, to see if we can get this solved today :smile:

Hello again, @Narsil! Hoping we can take a look at this today :slight_smile:

Hi @benjismith @morgan ,

Sorry about not replying I have been quite sick the last few days. Not covid fortunately.

The model is up and running on GPU now.
This should not work:

curl -X POST -d '{"inputs": "This is a test", "options": {"use_gpu":true}}' -H "Authorization: Bearer ${HF_API_TOKEN}"

Can you confirm ?

The error you were seeing is indeed pretty odd, since you are indeed allowed GPU model inference. Did you check in your tokens settings: Hugging Face – The AI community building the future. ?


Working like a charm now, thank you so much!!

BTW, in the future, if I want to pin another model on my account (such as the shaxpir/prosecraft_resumed_ft2 model, which is the same size and base-model as the shaxpir/prosecraft_linear_43195 model) will I need to ask for help, or will I be able to self-service those changes?

Currently it’s still not self served (Since it does not check for float 16, and would exceed default GPU without it).

I did however prepare the necessary config for this other model, so it should work out of the box once you pin it.

Hope this help.

Perfect, thank you!!

Thanks again for your help this week. I’m very happy with the API, and the model is looking great!

I tried unpinning the model you pinned for me, and then pinning the other one, shaxpir/prosecraft_resumed_ft2, which seemed seemed to work at first (in the pinning UI), but when I called the inference API endpoint (same args as before, just with the new model), I got this response:

{"error":"Could not load model shaxpir/prosecraft_resumed_ft2 with any of the following classes: (<class 'transformers.models.gptj.modeling_gptj.GPTJForCausalLM'>,)."}

Is there something on the model that we still need to fix?

@Narsil Can you take a look at this for me? Thanks!

Hi @benjismith Are you using GPU ? The model is loaded and works for GPU (not CPU)

Yes @Narsil, I’m using GPU.

Note that this is my OTHER model. You helped me load the shaxpir/prosecraft_linear_43195 with GPU pinning last week, and it’s been working great!

This other model (shaxpir/prosecraft_resumed_ft2) was built using a similar (but not identical) training process, and when I try running it (after pinning to a GPU device), I always get this error:

{"error":"Could not load model shaxpir/prosecraft_resumed_ft2 with any of the following classes: (<class 'transformers.models.gptj.modeling_gptj.GPTJForCausalLM'>,)."}

My API call looks like this…

curl -i -X POST \
     -H "Authorization: Bearer <<REDACTED>>" \
     -H "Content-Type: application/json" \
     -d \
          "inputs":"The witch stared through the tiny peephole in her front door and",
            "use_gpu": true,
            "use_cache": false
          "parameters": {
            "return_full_text": false,
            "num_return_sequences": 1,
            "temperature": 1.0,
            "top_p" : 0.9,
            "max_new_tokens": 250

Even though I’m using the use_gpu argument, and I’ve enabled GPU pinning on the model, the response headers look like this:

HTTP/1.1 400 Bad Request
date: Thu, 09 Dec 2021 21:03:22 GMT,Thu, 09 Dec 2021 21:04:29 GMT
server: istio-envoy
content-length: 167
content-type: application/json
x-envoy-upstream-service-time: 22266
x-compute-type: +optimized

On my other model, which ran successfully, I saw the x-compute-type response header had the value gpu+optimized, whereas this one is just +optimized. Maybe that’s a clue that’ll help track down the problem?

You are entirely correct.
I was sure to have tested your other model, but obviously I had done something wrong, since your model had an issue loading.

It is now fixed. So sorry about the confusion, but I really tested what I thought was both models, and thought everything was working fine.

Sweeeeeet, it’s working great now! Thank you so much for your help :smile: