503 No worker is available when calling single huggingface endpoint

I raised a previous issue here, where it was suggested that the issue was caused by how SageMaker pipelines work. However, I’m now calling a single endpoint directly and getting the same error (I created a separate post for it since the other deals mostly with pipelines, and this is focused on the endpoint). I am making 200 requests at about 50-60 requests per second and the error is triggered. It is offset by using multiple instances, but still appears occasionally. Average latency is about 1-2 seconds, with 2.1 seconds max. I’ve also tried using a larger instance, up to ml.m5.12xlarge but same result.

Below is my sagemaker deployment code, and the entrypoint script.

@philschmid you mentioned you haven’t seen this error before. But I am seeing this everywhere, and at not very high request loads directly to the endpoint. This makes me think there must be some underlying issue somehwere, but I am doing a pretty basic model deployment. Could there be some issue with the model.tar.gz arhcive itself? Unless I’m doing something I shouldn’t in the inference.py script below I really don’t know what’s causing this. Any help is appreciated. Thanks!

model = HuggingFaceModel(transformers_version="4.6", # transformers version used
                                                pytorch_version="1.7", # pytorch version used
                                                py_version='py36', # python version used
                                                entry_point = 'embed_source/inference.py',
                                             model_server_workers = 4, 
                                            name= emb_name, role=role)

model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge',
                 endpoint_name=model_name, wait=True)

I have also tested using the latest container version

transformers_version="4.12.3", # transformers version used
pytorch_version="1.9.1", # pytorch version used

with the same results.


import subprocess
import sys
import json
import os
import numpy as np
import torch
import boto3
from transformers import AutoModel, AutoTokenizer, AutoModelForMaskedLM
from importlib import reload    
print('\nboto3 loaded')    

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def forward_pass(batch, model):
    input_ids = torch.tensor(batch["input_ids"]).to(device)
    attention_mask = torch.tensor(batch["attention_mask"]).to(device)

    with torch.no_grad():
        last_hidden_state = model(input_ids, attention_mask).last_hidden_state
        last_hidden_state = last_hidden_state.cpu().numpy()

    # Use average of unmasked hidden states for classification
    lhs_shape = last_hidden_state.shape
    boolean_mask = ~np.array(batch["attention_mask"]).astype(bool)
    boolean_mask = np.repeat(boolean_mask, lhs_shape[-1], axis=-1)
    boolean_mask = boolean_mask.reshape(lhs_shape)
    masked_mean = np.ma.array(last_hidden_state, mask=boolean_mask).mean(axis=1)
    batch["hidden_state"] = masked_mean.data
    return batch

def preprocess_function(examples):
    print('attempting to tokenize:', examples)
    if examples['type'] == 'incomeType':
        my_type = 'income'
        my_type = 'expense'
    print('my_type', my_type)    
    t = tokenizer([examples[my_type]["description"]], truncation=True)
    t['description'] = examples[my_type]["description"]
    t['date'] = examples[my_type]["date"]
    t['amount'] = examples[my_type]["amount"]
    t['type'] = examples['type']
    print('t', t)
    return t

print('\nos.getcwd()', os.getcwd())
print('\nModel Walk:')
for path, subdirs, files in os.walk('/opt/ml'): 
    for name in files: print(os.path.join(path, name))
tokenizer = AutoTokenizer.from_pretrained('/opt/ml/model')

def model_fn(model_dir):
    print('\nIn model_fn')
    print('\nmodel_dir:', model_dir)
    model = AutoModel.from_pretrained('/opt/ml/model',  output_hidden_states=True).to(device)
    return model

print('\inference directory', os.listdir(os.curdir) )


for path, subdirs, files in os.walk('/opt/ml'): 
    for name in files: print(os.path.join(path, name))

def input_fn(data, content_type):
    print('\nin data', data, content_type, type(data))
    request = json.loads(data)
    # preprocess dataset
    print('attempting preprocess')
    print('request', request)
    response = preprocess_function(request)
    print('response', response)
    print('\nfwd pass')
    return response

def predict_fn(data, model):
    print('\nin predict:', data)
    res2 = forward_pass(data, model)
    return res2

def output_fn(prediction, accept):
    print('\nin output',  type(prediction))
    j = [{ "description":prediction.description,
          "type": prediction.type,
    return json.dumps(j)

Hey @MaximusDecimusMeridi,

Thanks for testing and opening another thread. I’ll try to reproduce this on my side. Could you share the script you use to send 50-60 requests per second?

I was able to reproduce the error. Using Locust and large model deployment. Using a large model bert-large-uncased-whole-word-masking-finetuned-squad and locust [Gist with code snippet for locust].

For me, the error started to appear at 195 concurrent requests per second. The reason for this could be that the buffer for the holding requests might run full and MMS (Inference service) is not excepting new ones. But if this is the case it should be documented. Also interesting to see is that nobody raised that before either it is less likely to happen on GPU with lower latency and higher throughput or people use autoscaling to avoid this.

Locust looks great - thanks for sharing. Will give it a try.

I was just using multiprocessor. The plan was touse higher instances to be able to send higher frequency. But I was able to get close to 200 on an xlarge notebook instances with 8 cpus, so I just let it keep creating new processes as the others close out.

from multiprocessing import Process, current_process

procs = []
start = datetime.now()

r = 1000
for index, number in enumerate(range(r)):
    #stress test function makes the invoke lambda or endpoint request
    proc = Process(target=stress_test, args=(payload, number, start, r))

for proc in procs:
print('end', datetime.now() - start)

Hey @MaximusDecimusMeridi,

I reached out to AWS to get more information i got the following response.

Confirmed it is what you suspected. We’ll update documentation. There’s a “job_queue_size” which defaults to 100. multi-model-server/configuration.md at master · awslabs/multi-model-server · GitHub
Once the queue is full, the requests are dropped. This can be solved by using more instance (or bigger instance but the effect may more limited if the model latency doesn’t benefit from scaling up). The property above can also be updated so there’s a bigger buffer but that would only be helpful for unsustained bursts of traffic.
It might be possible to update the “default_workers_per_model” property to help with throughput as well but I haven’t heard back from the MMS team on this part yet.

In your case is the workload a burst or would you have a consistent load?

Wow this is perfect. It’s exactly bursts of requests that are my problem. Our overall load is fairly low, and sustained increases in request loads we can handle with autoscaling. The service api that makes the prediction requests is rate limited, but it seems it can occasionally violate that rate limit in short request bursts which knocks out the endpoint. I believe the issue may be that the rate limit is set as an average over time, but it’s not something we have control over (we are working to get that looked into). As in the aws response, the only way to address these occasional bursts is to increase the base instance count to a number that can handle these peaks - so directly addressing request bursts would mean significant cost savings for us.

Can also confirm that setting “default_workers_per_model” helps. I am currently setting it to 4 as it would default to 1, but I have not experimented with it further. I’d be interested to hear what the MMS team says about it.

The property above can also be updated so there’s a bigger buffer but that would only be helpful for unsustained bursts of traffic.

Is this something I need to set or can set on my end, or would this be an update to the sagemaker transformers image?

Thanks again for following through with this issue!

@philschmid did you hear anything back on this from MMS team? How would I update job_queue_size property when deploying a HuggingFaceModel? I tried passing it as a parameter but it didn’t work. Thanks!

@MaximusDecimusMeridi I reached out again to see what we can do.

@MaximusDecimusMeridi i got a response and hopefully a solution.

The HF DLC container uses the sagemaker-huggingface-inference-toolkit to start MMS. When it starts MMS, it uses the default config.properties at: link (HF toolkit consumes sagemaker-inference-toolkit).
This config sets enable_envvars_config=true, which means at MMS start-time (endpoint creation), a user can supply environment variables to SageMaker, when defining the PrimaryContainer in sm_client.create_model() call. The supported config options are: link. Perhaps you can try setting this and check if it helps.

This means you should be able to provide the env key in the HuggingFaceModel with


to adjust the queue to 400 instead of 100.

1 Like

@philschmid awesome, I will give this a try and report back. Thanks for the followup :pray:

Hey @MaximusDecimusMeridi,

I tested it without success we need to modify the configuration. I already opened a PR: [huggingface] Enable overwritable MMS parameter by philschmid · Pull Request #1806 · aws/deep-learning-containers · GitHub
This will then be supported in the next release

1 Like

@philschmid wonderful, thanks.