Errors while running a sagemaker batch transform (inference) job

Background:

I am working with a model I fined tuned for a multi-classification problem (distilbert-base-uncased was my base model). I trying to use my model in a sagemaker batch transform (inference) job.

My data for inference was updated to S3 as a jsonl file and looks like this…

{"inputs":"...Some long text string, likely over 512 tokens after tokenization....","parameters":{"return_all_scores":true,"truncation":true,"max_length":512}}
{"inputs":"...Another long text string, likely over 512 tokens after tokenization....","parameters":{"return_all_scores":true,"truncation":true,"max_length":512}}

Note, the parameters were chosen because I need to truncate tokenized input strings that are longer than 512 tokens (per the requirements of my model) and because I want to return the prediction probability for all classes.

My Code for initiating the batch transform job looks like this…

huggingface_model = HuggingFaceModel(
    model_data=f"s3://{s3_training_job_model}/model.tar.gz",  # path trained model
    role=role
    transformers_version='4.17', 
    tensorflow_version='2.6',
    py_version="py38",
    env={'HF_TASK': 'text-classification' } )

batch_job = huggingface_model.transformer(
    instance_count=1,
    instance_type="ml.m5.xlarge", 
    strategy='SingleRecord',
    output_path = f's3://{s3_training_job_data}',
    accept='application/json')

batch_job.transform(
    data=s3_test_data_uri, # s3 path to my data for inference
    content_type='application/json',
    split_type='Line')

My Problem:

According to Sagemaker, my batch tranform job failed (the Status in sagemaker shows “Failed”).

I also get an error in my sagemaker notebook which halts my code.

ODDLY, Running this code generates a .out file with the predictions I would expect in the S3 location specified in output_path. It seems like my code is working, but I just can escape these pesky error messages.

My Error looks like this…

2023-05-04T13:47:48.177:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=SINGLE_RECORD
2023-05-04T13:47:53.326:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out: ClientError: 400
2023-05-04T13:47:53.327:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out: 
2023-05-04T13:47:53.327:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out: Message:
2023-05-04T13:47:53.327:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out: {
2023-05-04T13:47:53.327:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out:   "code": 400,
2023-05-04T13:47:53.327:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out:   "type": "InternalServerException",
2023-05-04T13:47:53.327:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out:   "message": "Extra data: line 1 column 46 (char 45)"
2023-05-04T13:47:53.327:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out: }
2023-05-04T13:47:53.342:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out.out: ClientError: 400
2023-05-04T13:47:53.343:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out.out: 
2023-05-04T13:47:53.343:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out.out: Message:
2023-05-04T13:47:53.343:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out.out: {
2023-05-04T13:47:53.343:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out.out:   "code": 400,
2023-05-04T13:47:53.343:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out.out:   "type": "InternalServerException",
2023-05-04T13:47:53.343:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out.out:   "message": "pop expected at most 1 argument, got 2"
2023-05-04T13:47:53.343:[sagemaker logs]: sagemaker-us-east-1-/jpIndClassTL-sample20-20230419214517/data/jp_test_data_for_preds.jsonl.out.out: }

Any thoughts @miOmiO @philschmid ? Thank you in advance to anyone that responds.

@huggingface I think this is bug in the code…

If I try to run this code and an output file (at f’{s3_test_data_uri}.out’) already exists, I get the error described above (even though the output file is successfully replaced).

If I run the above code, but first delete the output file, the code works.

Hello @jenpeper,

A bug in sagemaker? or where?