Payload too large for Async Inference on Sagemaker

In order to transcribe audio files withe the whisper model I would assume that the asyn inference option on AWS sagemaker might be the right choice for long audio files (1 hour, around 5-50mb).

According to the docs it should be possible to have payload sizes up to 1gb

I followed philipp Schmids article here

but I do get the following error which is surprising to me, since my payload is around 11mb.

Received client error (413) from primary and could not load the entire response body

hub = {
        'HF_MODEL_ID': 'openai/whisper-base',
        'HF_TASK': 'automatic-speech-recognition'

    # create Hugging Face Model Class
    huggingface_model = HuggingFaceModel(
        env=hub,  # configuration for loading model from Hub
        role=role,  # iam role with permissions to create an Endpoint
        transformers_version="4.26",  # transformers version used
        pytorch_version="1.13",  # pytorch version used
        py_version='py39',  # python version used

    # create async endpoint configuration
    async_config = AsyncInferenceConfig(
        output_path=s3_path_join("s3://", sagemaker_session_bucket, "async_inference/output"),
        # Where our results will be stored
        # notification_config={
        #   "SuccessTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
        #   "ErrorTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
        # }, #  Notification configuration

    # deploy the endpoint
        instance_type="ml.m5.xlarge",  # ml.g4dn.xlarge,

def predict():
    session = boto3.session.Session()
    sagemaker_session = sagemaker.Session(session)

    predictor = HuggingFacePredictor(endpoint_name=endpoint_name,
    async_predictor = AsyncPredictor(predictor)

    ASYNC_S3_PATH = "s3://async-inf/async-distilbert"

    with open(audio_path, "rb") as data_file:
        audio_data =

        data = {
            "s3_file": "s3://async-inf/async-distilbert"
            # "language": "pl"
        res = async_predictor.predict_async(input_path="s3://async-inf/async-distilbert")
        # res = async_predictor.predict_async(data=audio_data, input_path=ASYNC_S3_PATH)
        config = WaiterConfig(
            max_attempts=5,  # number of attempts
            delay=10  # time in seconds to wait between attempts


@philschmid Any idea about how to post large payloads to the async endpoint?
Anyhow thanks a lot for your tireless support. Very much appreciated.

Hey @philhd,

There might be a minor error in the inference code, it seems that "s3://async-inf/async-distilbert" is not pointing to a “file” only to a “directory”

Hey @philschmid , thanks for the swift reply. This s3://async-inf/async-distilbert is actually the file. I missed the file ending. I tried to rename it in the bucket and the code accordingly input_path="s3://async-inf/async-distilbert.mp3" but no change .

Have you tried creating a custom script to log some information, e.g. if the data file gets correct passed into the handler?

I got it up and running by doing it slightly different

def infer_async():
    sagemaker_runtime = boto3.client("sagemaker-runtime")

    # Specify the location of the input. Should be JSON with the input audion file (example in 02_deploy_whisper-Async.ipynb notebook)
    input_location = "s3://async-inf/input.json"

    # The name of the endpoint. The name must be unique within an AWS Region in your AWS account.

    # After you deploy a model using SageMaker hosting
    # services, your client applications use this API to get inferences
    # from the model hosted at the specified endpoint.
    response = sagemaker_runtime.invoke_endpoint_async(
        # ContentType='audio/mpeg',

Whats the structure of input.json

Whats the structure of input.json. I get ann error saying “No such file or directory: \u0027s3://”

depends on your inference script. you can try

"s3_location" : "path_to_s3"

For AsyncInference there is another very important configuration required to prevent the 413 error.

        'MMS_MAX_REQUEST_SIZE': '2000000000',
        'MMS_MAX_RESPONSE_SIZE': '2000000000',

HuggingFaceModel(env=env …)

would be nice to have it mentioned in the documentation