Error executing pinned inference model

Hello @julien-c!

Last week, I uploaded a private Text Generation model to my Huggingface account…

And then I enabled pinning on that model in our account here:

But when I try to execute an API call on this model, I always get an error message.

The API call looks like this…

curl -X POST \
     -H "Authorization: Bearer <<REDACTED>>" \
     -H "Content-Type: application/json" \
     -d \
          "inputs":"Once upon a time, there was a grumpy old toad who",
          "parameters": {"max_length": 500}

And the error is:

{"error":"We waited for too long for model shaxpir/prosecraft_linear_43195 to load. Please retry later or contact us. For very large models, this might be expected."}

I’ve been trying repeatedly, and waiting long intervals, but I still get this error every time.

It is quite a large model, but there are other larger models on public model cards that don’t seem to suffer from this problem. And I don’t see any documentation about model-size limitations for pinned private models (on CPU or GPU). Is there any guidance on that topic? Or is there anything that the support team can do to help me get un-stuck?

(Also, the “Pricing” page says that paid “Lab” plans come with email support, but the email address doesn’t seem to be published anywhere… I tried emailing but got no response for 9 days. And the obvious bounced back to me… Can you let me know where to send support emails?)

Thank you so much!!

@Narsil for guidance

1 Like


Thanks for this report. These are large models and are not deployed automatically.
There was indeed a bug in the pinning system that prevented you from seeing a nice error message.

I tried to load it manually for you, but it seems to be missing it’s tokenizer, so the API cannot work out of the box. Do you think you could add the missing files so it can work ?

Also for such a large model, and looking at the max_length you’re expecting, it’s unlikely that CPU will be enough to reply in a timely fashion, using GPU will be required (and most likely FP16 too, can you confirm if your model is FP16 enabled ? By default we don’t make the assumption to prevent showing incorrect results).

Thank you very much !


@Narsil I’ve added the tokenizer files, could you try manually loading it again please? Re FP16 yep the model can be used in FP16.

(I’m helping @benjismith on this :slight_smile: )

Okay @Narsil, thank you for your help! We updated the models with the tokenizer files, but I still can’t get the inference API to return a result. CPU-pinning still results in the same error message about waiting too long. And GPU pinning seems to work in the web UI (e.g., clicking the “PIN” button and choosing “GPU”) but when I try to invoke the inference API like this:

curl -X POST \
     -H "Authorization: Bearer <<REDACTED>>" \
     -H "Content-Type: application/json" \
     -d \
          "inputs" : "Once upon a time, there was a grumpy old toad who",
          "options": {
            "wait_for_model" : true,
            "use_gpu" : true
          "parameters" : {"max_length": 10}

…I get this error:

{"error":"You are not allowed to run GPU requests, please check your subscription plan on or contact"}

But I have the $200 “Lab” plan, and your pricing page says:

Pin models on CPU or GPU

Instant availability for inference - $50/mo on CPU, $200/mo on GPU

Can you help me resolve this?

(Also, I tried to cancel-and-restart my plan, thinking that might help, but I can’t restart it until the billing period ends… So if you can un-cancel my plan while you’re fixing the other problem, that’d be great!)

Thank you!!

@Narsil Any chance you can help with this today?