Error executing pinned inference model

benjismith · November 21, 2021, 6:41am

Last week, I uploaded a private Text Generation model to my Huggingface account…

https://huggingface.co/shaxpir/prosecraft_linear_43195/

And then I enabled pinning on that model in our account here:

https://api-inference.huggingface.co/dashboard/pinned_models

But when I try to execute an API call on this model, I always get an error message.

The API call looks like this…

curl -X POST https://api-inference.huggingface.co/models/shaxpir/prosecraft_linear_43195 \
     -H "Authorization: Bearer <<REDACTED>>" \
     -H "Content-Type: application/json" \
     -d \
     '{
          "inputs":"Once upon a time, there was a grumpy old toad who",
          "options":{"wait_for_model":true},
          "parameters": {"max_length": 500}
     }'

And the error is:

{"error":"We waited for too long for model shaxpir/prosecraft_linear_43195 to load. Please retry later or contact us. For very large models, this might be expected."}

I’ve been trying repeatedly, and waiting long intervals, but I still get this error every time.

It is quite a large model, but there are other larger models on public model cards that don’t seem to suffer from this problem. And I don’t see any documentation about model-size limitations for pinned private models (on CPU or GPU). Is there any guidance on that topic? Or is there anything that the support team can do to help me get un-stuck?

(Also, the “Pricing” page says that paid “Lab” plans come with email support, but the email address doesn’t seem to be published anywhere… I tried emailing api-enterprise@huggingface.co but got no response for 9 days. And the obvious support@huggingface.co bounced back to me… Can you let me know where to send support emails?)

Thank you so much!!

julien-c · November 22, 2021, 9:48am

@Narsil for guidance

Narsil · November 23, 2021, 11:09am

@benjismith

Thanks for this report. These are large models and are not deployed automatically.
There was indeed a bug in the pinning system that prevented you from seeing a nice error message.

I tried to load it manually for you, but it seems to be missing it’s tokenizer, so the API cannot work out of the box. Do you think you could add the missing files so it can work ?

Also for such a large model, and looking at the max_length you’re expecting, it’s unlikely that CPU will be enough to reply in a timely fashion, using GPU will be required (and most likely FP16 too, can you confirm if your model is FP16 enabled ? By default we don’t make the assumption to prevent showing incorrect results).

Thank you very much !

Cheers.

morgan · November 26, 2021, 1:33pm

@Narsil I’ve added the tokenizer files, could you try manually loading it again please? Re FP16 yep the model can be used in FP16.

(I’m helping @benjismith on this )

benjismith · November 29, 2021, 6:53pm

Okay @Narsil, thank you for your help! We updated the models with the tokenizer files, but I still can’t get the inference API to return a result. CPU-pinning still results in the same error message about waiting too long. And GPU pinning seems to work in the web UI (e.g., clicking the “PIN” button and choosing “GPU”) but when I try to invoke the inference API like this:

curl -X POST https://api-inference.huggingface.co/models/shaxpir/prosecraft_linear_43195 \
     -H "Authorization: Bearer <<REDACTED>>" \
     -H "Content-Type: application/json" \
     -d \
     '{
          "inputs" : "Once upon a time, there was a grumpy old toad who",
          "options": {
            "wait_for_model" : true,
            "use_gpu" : true
          },
          "parameters" : {"max_length": 10}
     }'

…I get this error:

{"error":"You are not allowed to run GPU requests, please check your subscription plan on https://huggingface.co/ or contact api-entreprise@huggingface.co"}

But I have the $200 “Lab” plan, and your pricing page says:

Pin models on CPU or GPU

Instant availability for inference - $50/mo on CPU, $200/mo on GPU

Can you help me resolve this?

(Also, I tried to cancel-and-restart my plan, thinking that might help, but I can’t restart it until the billing period ends… So if you can un-cancel my plan while you’re fixing the other problem, that’d be great!)

Thank you!!

benjismith · November 30, 2021, 10:55pm

@Narsil Any chance you can help with this today?

Thanks!!

benjismith · December 1, 2021, 8:40pm

@Narsil Checking in again, to see if we can get this solved today

benjismith · December 2, 2021, 4:43pm

Hello again, @Narsil! Hoping we can take a look at this today

Narsil · December 3, 2021, 11:29am

Hi @benjismith @morgan ,

Sorry about not replying I have been quite sick the last few days. Not covid fortunately.

The model is up and running on GPU now.
This should not work:

curl -X POST -d '{"inputs": "This is a test", "options": {"use_gpu":true}}' https://api-inference.huggingface.co/models/shaxpir/prosecraft_linear_43195 -H "Authorization: Bearer ${HF_API_TOKEN}"

Can you confirm ?

The error you were seeing is indeed pretty odd, since you are indeed allowed GPU model inference. Did you check in your tokens settings: Hugging Face – The AI community building the future. ?

Cheers,
Nicolas

benjismith · December 3, 2021, 4:30pm

Working like a charm now, thank you so much!!

benjismith · December 3, 2021, 5:06pm

BTW, in the future, if I want to pin another model on my account (such as the shaxpir/prosecraft_resumed_ft2 model, which is the same size and base-model as the shaxpir/prosecraft_linear_43195 model) will I need to ask for help, or will I be able to self-service those changes?

Narsil · December 3, 2021, 5:16pm

Currently it’s still not self served (Since it does not check for float 16, and would exceed default GPU without it).

I did however prepare the necessary config for this other model, so it should work out of the box once you pin it.

Hope this help.
Nicolas

benjismith · December 3, 2021, 5:18pm

Perfect, thank you!!

benjismith · December 5, 2021, 8:46am

Thanks again for your help this week. I’m very happy with the API, and the model is looking great!

I tried unpinning the model you pinned for me, and then pinning the other one, shaxpir/prosecraft_resumed_ft2, which seemed seemed to work at first (in the pinning UI), but when I called the inference API endpoint (same args as before, just with the new model), I got this response:

{"error":"Could not load model shaxpir/prosecraft_resumed_ft2 with any of the following classes: (<class 'transformers.models.gptj.modeling_gptj.GPTJForCausalLM'>,)."}

Is there something on the model that we still need to fix?

benjismith · December 8, 2021, 1:41am

@Narsil Can you take a look at this for me? Thanks!

Narsil · December 8, 2021, 9:25am

Hi @benjismith Are you using GPU ? The model is loaded and works for GPU (not CPU)

benjismith · December 9, 2021, 9:07pm

Yes @Narsil, I’m using GPU.

Note that this is my OTHER model. You helped me load the shaxpir/prosecraft_linear_43195 with GPU pinning last week, and it’s been working great!

This other model (shaxpir/prosecraft_resumed_ft2) was built using a similar (but not identical) training process, and when I try running it (after pinning to a GPU device), I always get this error:

{"error":"Could not load model shaxpir/prosecraft_resumed_ft2 with any of the following classes: (<class 'transformers.models.gptj.modeling_gptj.GPTJForCausalLM'>,)."}

My API call looks like this…

curl -i -X POST https://api-inference.huggingface.co/models/shaxpir/prosecraft_resumed_ft2 \
     -H "Authorization: Bearer <<REDACTED>>" \
     -H "Content-Type: application/json" \
     -d \
     '{
          "inputs":"The witch stared through the tiny peephole in her front door and",
          "options":{
            "use_gpu": true,
            "use_cache": false
          },
          "parameters": {
            "return_full_text": false,
            "num_return_sequences": 1,
            "temperature": 1.0,
            "top_p" : 0.9,
            "max_new_tokens": 250
          }
     }'

Even though I’m using the use_gpu argument, and I’ve enabled GPU pinning on the model, the response headers look like this:

HTTP/1.1 400 Bad Request
date: Thu, 09 Dec 2021 21:03:22 GMT,Thu, 09 Dec 2021 21:04:29 GMT
server: istio-envoy
content-length: 167
content-type: application/json
x-envoy-upstream-service-time: 22266
x-compute-type: +optimized

On my other model, which ran successfully, I saw the x-compute-type response header had the value gpu+optimized, whereas this one is just +optimized. Maybe that’s a clue that’ll help track down the problem?

Narsil · December 10, 2021, 10:29am

You are entirely correct.
I was sure to have tested your other model, but obviously I had done something wrong, since your model had an issue loading.

It is now fixed. So sorry about the confusion, but I really tested what I thought was both models, and thought everything was working fine.

benjismith · December 10, 2021, 8:10pm

Sweeeeeet, it’s working great now! Thank you so much for your help

Topic		Replies	Views
Pinning models doesn't seem to work 🤗Hub	5	1842	August 9, 2022
Executing pinned inference model Models	1	320	May 4, 2023
Pinned model still needs to load Beginners	2	601	September 12, 2022
Accelerated Inference API can't load a model on GPU Intermediate	13	2177	January 16, 2023
Inference API stopped working for my model 🤗Hub	11	5423	April 26, 2023

Error executing pinned inference model

Related topics