Getting "No worker is available to serve request: model" with HuggingFaceModel endpoint

Hello @MaximusDecimusMeridi,

by default the HuggingFace Inference DLC starts as many Workers as CPU cores you have. Meaning for m5n.xlarge instance you have 4 workers.

Regarding the error you see:

  • Are you using Multi-Model Endpoints?
  • What was the memory utilization?
  • How long does the request take? → It could be possible that all workers were blocked due to either long inference or a deadlock inside your code and didn’t finish so it wasn’t possible to receive new requests
  • could you try updating the latest image? Reference
  • “During feature extraction another endpoint is being called for generating text embeddings” → does this mean the endpoint which returned the 503 calls another endpoint? (i couldn’t find something in the script) If that’s true then point 3 might be the reason. Since you would block the worker until the inner requests is resolved and generation can take quite long.

P.S. feel free to share a proper architecture on what you do. Happy to potentially improve it and solve those bottlenecks with a more async approach.