Inference API stopped working for my model

benjismith · April 24, 2023, 7:26pm

I have a pinned model (shaxpir/prosecraft_resumed_ft2) on the inference API that has been working well for over a year, but it recently stopped working… When I make a request like this:

curl -i -X POST https://api-inference.huggingface.co/models/shaxpir/prosecraft_resumed_ft2 \
     -H "Authorization: Bearer <REDACTED>" \
     -H "Content-Type: application/json" \
     -d \
     '{
          "inputs":"Once upon a time,",
          "options":{
            "use_gpu": true,
            "use_cache": false
          },
          "parameters": {
            "return_full_text": false,
            "num_return_sequences": 1,
            "temperature": 1.0,
            "top_p" : 0.9,
            "max_new_tokens": 250
          }
     }'

I get a 503 error telling me that the model is currently loading…

HTTP/2 503
date: Mon, 24 Apr 2023 19:17:42 GMT
content-type: application/json
content-length: 91
x-request-id: JcowSHvjgHSgCiowln3Zm
access-control-allow-credentials: true
vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers

{
  "error" : "Model shaxpir/prosecraft_resumed_ft2 is currently loading",
  "estimated_time" : 20.0
}

But the model never seems to fully load, and the “estimated_time” never changes.

Can you help?

benjismith · April 25, 2023, 6:29pm

@radames Can you help?

radames · April 25, 2023, 6:38pm

hi @benjismith , could you make the model public?

Another way to test it, is to run the steps on our public api-inference docs GitHub - huggingface/api-inference-community
This is how I’d test your model if the main backend is not responding.

benjismith · April 25, 2023, 7:50pm

Okay, I made the model public!

I don’t understand the testing steps you referred to… It looks like those steps are only relevant to someone using docker/python to host their own inference api. But I’m using the HF-hosted API. Unless I’m misunderstanding which docs you’re talking about…

radames · April 25, 2023, 8:33pm

Yes, that repository is part of what drives the hosted API inference, so it’s useful for testing if there’s something wrong with the model. However, in your case, it seems that the model loading is timing out due to its size of 12.1 GB. I’m not sure what the constraints are, perhaps @Narsil could provide more information. One last thing to try would be to convert your model to safetensors, as this can improve loading time, and then attempt use the hosted API. Convert to Safetensors - a Hugging Face Space by safetensors

benjismith · April 25, 2023, 9:24pm

I’ve been trying to run the safetensor conversion, but I keep getting the error message “This application is too busy. Keep trying!”

Is that expected?

benjismith · April 25, 2023, 9:29pm

BTW, I’m also interested in potentially setting up Inference Endpoints to host this model when it goes into production… But I don’t know what size GPU instance would be necessary. This is a fine-tuned GPT-J-6B model, whose weights have been converted to FP16.

I’m also working on another model which will be based on GPT-NeoX-20B, and I have the same question. What size instances will I need when I setup production Inference Endpoints?

radames · April 25, 2023, 9:44pm

I was trying the conversion as well but it also failed. In terms on Inference Endpoints cc @philschmid to respond to your questions, ps he’s on CET timezone.

philschmid · April 26, 2023, 6:15am

GPT-J6B runs on a 1x T4 using sharding and a custom handler for low memory consumptions, here is an example: handler.py · philschmid/gpt-j-6B-fp16-sharded at main

GPT-NeoX works on 4x T4 or a single A100

benjismith · April 26, 2023, 6:09pm

Thanks for the info! That’s very helpful!

benjismith · April 26, 2023, 6:11pm

I’ve clicked that “Convert to Safetensors” button about 50 times since yesterday… Most of the time I get the same “too busy” error. Occasionally, my job goes into the queue but then it times out before running.

Is this the only way to restore the functionality of my model? It was running just fine for over a year and only recently disappeared…

radames · April 26, 2023, 6:57pm

Yes, I guess the convert hardware it’s running OOM, I just tried on a GPU with 16GB VRAM and it also ran OOM, if you want to diy and have a hardware you can clone this repo Convert to Safetensors - a Hugging Face Space by safetensors and tried the conversion locally

Topic		Replies	Views
Cannot run large models using API token Inference Endpoints on the Hub	5	7304	February 22, 2024
Error executing pinned inference model 🤗Hub	18	3780	December 10, 2021
Inference API stopped working Inference Endpoints on the Hub	50	4637	June 8, 2025
My model is doesnt seem to load in Inference API Beginners	0	487	July 24, 2022
Inference API Widget wont stop loading for my private model Community Calls	0	267	December 6, 2023

Inference API stopped working for my model

Related topics