Hello @MaximusDecimusMeridi,
by default the HuggingFace Inference DLC starts as many Workers as CPU cores you have. Meaning for m5n.xlarge
instance you have 4 workers.
Regarding the error you see:
- Are you using Multi-Model Endpoints?
- What was the memory utilization?
- How long does the request take? → It could be possible that all workers were blocked due to either long inference or a deadlock inside your code and didn’t finish so it wasn’t possible to receive new requests
- could you try updating the latest image? Reference
- “During feature extraction another endpoint is being called for generating text embeddings” → does this mean the endpoint which returned the 503 calls another endpoint? (i couldn’t find something in the script) If that’s true then point 3 might be the reason. Since you would block the worker until the inner requests is resolved and generation can take quite long.
P.S. feel free to share a proper architecture on what you do. Happy to potentially improve it and solve those bottlenecks with a more async approach.