Deploying Open AI's whisper on Sagemaker

Hi there!

I’m trying to deploy Open AI’s whisper-large model, using the suggested code snippet on the hub. The actual deployment of the model succeeds on an ml.m5.xlarge instance. But when I try to invoke the endpoint as follows, I get the error below. What am I doing wrong?

aws = boto3.Session(
    region_name="region-goes-here",
    aws_access_key_id="access-key-goes-here",
    aws_secret_access_key="secret-access-key-goes-here",
)

runtime = aws.client('runtime.sagemaker')

with open("some_audio_file.mp3", "rb") as file:
    audio_file = file.read()

response = runtime.invoke_endpoint(
      EndpointName=ENDPOINT,
      Body=audio_file,
  )
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "\u0027whisper\u0027"
}
1 Like

Looking at the logs in CloudWatch, I see the following traceback:

2022-10-21T08:12:59,629 [INFO ] W-9000-openai__whisper-large com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 0
2022-10-21T08:12:59,629 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
2022-10-21T08:12:59,629 [INFO ] W-9000-openai__whisper-large ACCESS_LOG - /169.254.178.2:33784 "POST /invocations HTTP/1.1" 400 5
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 219, in handle
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.initialize(context)
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 77, in initialize
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.model = self.load(self.model_dir)
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 104, in load
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     hf_pipeline = get_pipeline(task=os.environ["HF_TASK"], model_dir=model_dir, device=self.device)
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/transformers_utils.py", line 272, in get_pipeline
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/__init__.py", line 541, in pipeline
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     config = AutoConfig.from_pretrained(model, revision=revision, _from_pipeline=task, **model_kwargs)
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 657, in from_pretrained
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     config_class = CONFIG_MAPPING[config_dict["model_type"]]
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 372, in __getitem__
2022-10-21T08:12:59,630 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise KeyError(key)
2022-10-21T08:12:59,631 [INFO ] W-openai__whisper-large-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - KeyError: 'whisper'

Could it have something to do with the version of the transformers package? I deployed it with version 4.17.0.

hey @thusken,

whisper was added in transformers==4.23.1 you need to update the version then it should work.

Hi @philschmid ,

Thanks for the reply! Although when I try to deploy to Sagemaker with a newer transformers version, I get the error:

Unsupported huggingface version: 4.23.1. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.

Doing a pip install -U sagemaker doesn’t help. Any tips?

Hi @thusken - I assume your using the SageMaker Hugginface DLC, is that right? If so, the latest available version for those is currently 4.17, see here.

What you can do, however, is to provide a requirements.txt file that will let you specify dependencies you want to be installed in the container. If you put transformers==4.23.1 into that requirements.txt file then this transformers version will be installed. See here for more info.

Hope that helps.

Cheers
Heiko

Hey @thusken , I’m also trying to deploy a Whisper powered app. Could you make it work?

I haven’t had time since last week to look at this, so unfortunately I can’t help you here.

Also, thanks for the reply @marshmellow77 ! I guess that approach also involves creating a separate inference.py file as shown in the example? So far I just deployed the model straight from the hub.

I see. Thanks for the reply.

To use the latest version of the transformers library in a SageMaker DLC you don’t have to provide a custome inference script, just a requirements.txt file with a line that says transformers==4.23.1. The DLC will then install/update to the specified transformers version.

This example & this documentation show how to do that, hope that helps.

2 Likes

Thanks for the input @marshmellow77 ! I managed to get a few steps further in deploying the model, using the examples you linked. For those interested, here’s my current deployment code snippet:



# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=s3_location,       # path to your model and script
    role=role,                    # iam role with permissions to create an Endpoint
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    py_version='py38',           # python version used
    env={
        'HF_MODEL_ID':'openai/whisper-large',
        'HF_TASK':'automatic-speech-recognition'
    }
)


# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.p2.xlarge"
    )

Here, the s3_location variable contains the location of the model archive:

repository = "openai/whisper-large"
model_id=repository.split("/")[-1]
s3_location=f"s3://{sess.default_bucket()}/custom_inference/{model_id}/model.tar.gz"

This model archive is created by the following bash commands (executed from a Sagemaker notebook):

!git lfs install
!git clone https://huggingface.co/$repository
!cp -r code/ $model_id/code/
%cd $model_id
!tar zcvf model.tar.gz *
!aws s3 cp model.tar.gz $s3_location

Note that you need to add a requirements.txt file with transformers==4.23.1 in the code folder after you execute the git clone command for this to work.

The endpoint is now able to load the Whisper model, which of course a big step forward, but I’m as of now not yet able to properly call the endpoint. For example, if I load a small audio file and try to predict it as follows,

from transformers.pipelines.automatic_speech_recognition import ffmpeg_read

SAMPLING_RATE = 1000
with open("some_audio_file.mp3", "rb") as file:
    audio_file = file.read()
audio_nparray = ffmpeg_read(audio_file, SAMPLING_RATE)

predictor.predict({
	"raw": audio_nparray,
    "sampling_rate": SAMPLING_RATE
})

the following error is raised:

{
  "code": 400,
  "type": "InternalServerException",
  "message": "expected np.ndarray (got list)"
}

Even though audio_nparray is actually a NumPy array:

type(audio_nparray)
numpy.ndarray

@thusken I wasn’t able to reproduce your error message, but I wanted to let you know that your approach of sending an NP array to the endpoint is probably not the best idea. The reason for that is that SM endpoints have a payload size limit of 6 MB:


(from Amazon SageMaker endpoints and quotas - AWS General Reference)

What you could do instead is using a DataSerializer to stream the raw file to the endpoint. You would have to sepcify the serializer when deploying the model, see here.

@thusken
According to this notebook, you should specify DataSerializer to serialize data.

hub = {
	'HF_MODEL_ID':'openai/whisper-base',
	'HF_TASK':'automatic-speech-recognition'
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data = s3_location,
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
audio_serializer = DataSerializer(content_type='audio/x-audio')
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m5.xlarge', # ec2 instance type
    serializer=audio_serializer
)

Based on that issue, you can use serializers.

The Hugging Face inference toolkit supports all the transformers pipelines with their default inputs. The Toolkit implements several serializers to parse binary data, e.g., audio or images to the matching format for the transformers pipeline, e.g., PIL or np.

So, for inference below code worked in my environment.

audio_path = "sample1.flac"

res = predictor.predict(data=audio_path)
print(res)
2 Likes

Great example @sohoha , thanks for sharing :+1:

2 Likes

Sorry for the late reply, but this indeed gets it working! Thanks for the useful information!

2 Likes

@sohoha Thanks a lot for all your guidance. I tried the same approach but unfortunately it is failing with "error An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (413) from primary and could not load the entire response body.". I then looked into CW and see the error
I did more analysis and dont see any code folder into the path $model_id/code/ . Does this means the repo structure has been changed now?
@thusken did you face similar error?

2022-11-25T13:30:27,599 [INFO ] W-openai__whisper-base-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise KeyError(key)

2022-11-25T13:30:27,599 [INFO ] W-openai__whisper-base-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - KeyError: ‘whisper’

@amitkayal, this error usually appears if you use a transformers version that doesn’t support the Whisper model, see also the beginning of this thread.

Can you confirm you upgraded transformer library to a version that support Whisper (see also this post: Deploying Open AI's whisper on Sagemaker - #9 by marshmellow77)?

1 Like

@marshmellow77 @thusken
Yes, I am using transformer version 4.23.1 in my requirements.txt file. Do I also need inference.py file? I believe we dont need that.

The log has error “”"

ModelError Traceback (most recent call last)
/tmp/ipykernel_11082/3757015611.py in
----> 1 res = predictor.predict(data=audio_path)
2 print(res)

~/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant, inference_id)
159 data, initial_args, target_model, target_variant, inference_id
160 )
→ 161 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
162 return self._handle_response(response)
163

~/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
528 )
529 # The “self” in this scope is referring to the BaseClient.
→ 530 return self._make_api_call(operation_name, kwargs)
531
532 _api_call.name = str(py_operation_name)

~/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
958 error_code = parsed_response.get(“Error”, {}).get(“Code”)
959 error_class = self.exceptions.from_code(error_code)
→ 960 raise error_class(parsed_response, operation_name)
961 else:
962 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (413) from primary and could not load the entire response body. See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/vanila-openai-whisper-tiny-2022-11-25-15-29-02 in account 015267333920 for more information.

'“”"
However, i could not use transformer 4.23.1 version in the following code for HuggingFaceModel object as this is throwing me error "“Unsupported huggingface version: 4.23.1. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17”. Do you think this is causing issue?

from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.serializers import DataSerializer

hub = {

  • ‘HF_MODEL_ID’:‘openai/whisper-base’,*
  • ‘HF_TASK’:‘automatic-speech-recognition’*
    }

huggingface_model = HuggingFaceModel(

  • env=hub, # configuration for loading model from Hub*
  • role=role, # iam role with permissions to create an Endpoint*
  • model_data = s3_location,*
  • transformers_version=“4.23.1”, # transformers version used*
  • pytorch_version=“1.10.2”, # pytorch version used*
  • py_version=‘py38’, # python version used*
    )

Thanks

I found the issue and RCA is that input size was more than 6MB and so this issue happened. How I can do workaround this? Is batch transform only way to avoid the issue? I also wanted to know how i can call this endpoint later as that time i will not have predictor object available.

Thanks

Sorry for late response. Though you should customize inference.py, you can use real-time endpoint. When inference, you upload your audio file to S3 and pass that path to the endpoint. Then, in inference.py you can implement logics to download specified file from S3 and recognize speech.

Or, you can split your audio file into less than 6MB and merge the results after inference.

1 Like

Hello @marshmellow77 @sohoha
I tried replicating this approach but every time whenever even I upload video or audio of size less than 6MB of length 1 minute, it just output transcription of 30 seconds. Can we vary this 30 second limitation atleast for 6MB of input video size?
Hoping to get reply back soon.
Thanks