Inference API stopped working for my model

I have a pinned model (shaxpir/prosecraft_resumed_ft2) on the inference API that has been working well for over a year, but it recently stopped working… When I make a request like this:

curl -i -X POST \
     -H "Authorization: Bearer <REDACTED>" \
     -H "Content-Type: application/json" \
     -d \
          "inputs":"Once upon a time,",
            "use_gpu": true,
            "use_cache": false
          "parameters": {
            "return_full_text": false,
            "num_return_sequences": 1,
            "temperature": 1.0,
            "top_p" : 0.9,
            "max_new_tokens": 250

I get a 503 error telling me that the model is currently loading…

HTTP/2 503
date: Mon, 24 Apr 2023 19:17:42 GMT
content-type: application/json
content-length: 91
x-request-id: JcowSHvjgHSgCiowln3Zm
access-control-allow-credentials: true
vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers

  "error" : "Model shaxpir/prosecraft_resumed_ft2 is currently loading",
  "estimated_time" : 20.0

But the model never seems to fully load, and the “estimated_time” never changes.

Can you help?

@radames Can you help?

hi @benjismith , could you make the model public?

Another way to test it, is to run the steps on our public api-inference docs GitHub - huggingface/api-inference-community
This is how I’d test your model if the main backend is not responding.

Okay, I made the model public!

I don’t understand the testing steps you referred to… It looks like those steps are only relevant to someone using docker/python to host their own inference api. But I’m using the HF-hosted API. Unless I’m misunderstanding which docs you’re talking about…

Yes, that repository is part of what drives the hosted API inference, so it’s useful for testing if there’s something wrong with the model. However, in your case, it seems that the model loading is timing out due to its size of 12.1 GB. I’m not sure what the constraints are, perhaps @Narsil could provide more information. One last thing to try would be to convert your model to safetensors, as this can improve loading time, and then attempt use the hosted API. Convert to Safetensors - a Hugging Face Space by safetensors

I’ve been trying to run the safetensor conversion, but I keep getting the error message “This application is too busy. Keep trying!”

Is that expected?

BTW, I’m also interested in potentially setting up Inference Endpoints to host this model when it goes into production… But I don’t know what size GPU instance would be necessary. This is a fine-tuned GPT-J-6B model, whose weights have been converted to FP16.

I’m also working on another model which will be based on GPT-NeoX-20B, and I have the same question. What size instances will I need when I setup production Inference Endpoints?

I was trying the conversion as well but it also failed. In terms on Inference Endpoints cc @philschmid to respond to your questions, ps he’s on CET timezone.

GPT-J6B runs on a 1x T4 using sharding and a custom handler for low memory consumptions, here is an example: · philschmid/gpt-j-6B-fp16-sharded at main

GPT-NeoX works on 4x T4 or a single A100


Thanks for the info! That’s very helpful!

I’ve clicked that “Convert to Safetensors” button about 50 times since yesterday… Most of the time I get the same “too busy” error. Occasionally, my job goes into the queue but then it times out before running.

Is this the only way to restore the functionality of my model? It was running just fine for over a year and only recently disappeared…

Yes, I guess the convert hardware it’s running OOM, I just tried on a GPU with 16GB VRAM and it also ran OOM, if you want to diy and have a hardware you can clone this repo Convert to Safetensors - a Hugging Face Space by safetensors and tried the conversion locally

Okay. I don’t have access to the hardware I’d need to do a DIY conversion.

Is it no longer possible to just restore the model to whatever state it was in previously, which was performant and stable?