ExitCode 1 ErrorMessage "KeyError: 'Image' when using entry_point script in Huggingface Estimator

Trying to train and deploy a Vision Transformer model in Sagemaker using huggingface. I have used tried the Huggingface example [notebook on the matter](https://github.com/huggingface/notebooks/tree/main/sagemaker/09_image_classification_vision_transformer) and still getting the same error: KeyError ‘Image’:

The output is saying that the error is in the load_from_disk function, but it doesn’t make sense to me: With the same path, when performing load_from_disk outside of the training script (meaning, in the notebook I am trying to invoke the training script from), the dataset loads fine.

This is the full output:
2022-09-30 09:08:29 Training - Downloading the training image…bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-09-30 09:11:42,795 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-09-30 09:11:42,830 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-09-30 09:11:42,837 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-09-30 09:11:43,425 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
“additional_framework_parameters”: {},
“channel_input_dirs”: {
“test”: “/opt/ml/input/data/test”,
“train”: “/opt/ml/input/data/train”
},
“current_host”: “algo-1”,
“framework_module”: “sagemaker_pytorch_container.training:main”,
“hosts”: [
“algo-1”
],
“hyperparameters”: {
“model_name”: “google/vit-base-patch16-224-in21k”,
“num_train_epochs”: 1,
“per_device_train_batch_size”: 16
},
“input_config_dir”: “/opt/ml/input/config”,
“input_data_config”: {
“test”: {
“TrainingInputMode”: “File”,
“S3DistributionType”: “FullyReplicated”,
“RecordWrapperType”: “None”
},
“train”: {
“TrainingInputMode”: “File”,
“S3DistributionType”: “FullyReplicated”,
“RecordWrapperType”: “None”
}
},
“input_dir”: “/opt/ml/input”,
“is_master”: true,
“job_name”: “huggingface-pytorch-training-2022-09-30-09-04-04-385”,
“log_level”: 20,
“master_hostname”: “algo-1”,
“model_dir”: “/opt/ml/model”,
“module_dir”: “s3://sagemaker-eu-central-1-232977032562/huggingface-pytorch-training-2022-09-30-09-04-04-385/source/sourcedir.tar.gz”,
“module_name”: “train”,
“network_interface_name”: “eth0”,
“num_cpus”: 8,
“num_gpus”: 1,
“output_data_dir”: “/opt/ml/output/data”,
“output_dir”: “/opt/ml/output”,
“output_intermediate_dir”: “/opt/ml/output/intermediate”,
“resource_config”: {
“current_host”: “algo-1”,
“current_instance_type”: “ml.p3.2xlarge”,
“current_group_name”: “homogeneousCluster”,
“hosts”: [
“algo-1”
],
“instance_groups”: [
{
“instance_group_name”: “homogeneousCluster”,
“instance_type”: “ml.p3.2xlarge”,
“hosts”: [
“algo-1”
]
}
],
“network_interface_name”: “eth0”
},
“user_entry_point”: “train.py”
}
Environment variables:
SM_HOSTS=[“algo-1”]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={“model_name”:“google/vit-base-patch16-224-in21k”,“num_train_epochs”:1,“per_device_train_batch_size”:16}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={“current_group_name”:“homogeneousCluster”,“current_host”:“algo-1”,“current_instance_type”:“ml.p3.2xlarge”,“hosts”:[“algo-1”],“instance_groups”:[{“hosts”:[“algo-1”],“instance_group_name”:“homogeneousCluster”,“instance_type”:“ml.p3.2xlarge”}],“network_interface_name”:“eth0”}
SM_INPUT_DATA_CONFIG={“test”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”},“train”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[“test”,“train”]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-central-1-232977032562/huggingface-pytorch-training-2022-09-30-09-04-04-385/source/sourcedir.tar.gz
SM_TRAINING_ENV={“additional_framework_parameters”:{},“channel_input_dirs”:{“test”:“/opt/ml/input/data/test”,“train”:“/opt/ml/input/data/train”},“current_host”:“algo-1”,“framework_module”:“sagemaker_pytorch_container.training:main”,“hosts”:[“algo-1”],“hyperparameters”:{“model_name”:“google/vit-base-patch16-224-in21k”,“num_train_epochs”:1,“per_device_train_batch_size”:16},“input_config_dir”:“/opt/ml/input/config”,“input_data_config”:{“test”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”},“train”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”}},“input_dir”:“/opt/ml/input”,“is_master”:true,“job_name”:“huggingface-pytorch-training-2022-09-30-09-04-04-385”,“log_level”:20,“master_hostname”:“algo-1”,“model_dir”:“/opt/ml/model”,“module_dir”:“s3://sagemaker-eu-central-1-232977032562/huggingface-pytorch-training-2022-09-30-09-04-04-385/source/sourcedir.tar.gz”,“module_name”:“train”,“network_interface_name”:“eth0”,“num_cpus”:8,“num_gpus”:1,“output_data_dir”:“/opt/ml/output/data”,“output_dir”:“/opt/ml/output”,“output_intermediate_dir”:“/opt/ml/output/intermediate”,“resource_config”:{“current_group_name”:“homogeneousCluster”,“current_host”:“algo-1”,“current_instance_type”:“ml.p3.2xlarge”,“hosts”:[“algo-1”],“instance_groups”:[{“hosts”:[“algo-1”],“instance_group_name”:“homogeneousCluster”,“instance_type”:“ml.p3.2xlarge”}],“network_interface_name”:“eth0”},“user_entry_point”:“train.py”}
SM_USER_ARGS=[“–model_name”,“google/vit-base-patch16-224-in21k”,“–num_train_epochs”,“1”,“–per_device_train_batch_size”,“16”]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_MODEL_NAME=google/vit-base-patch16-224-in21k
SM_HP_NUM_TRAIN_EPOCHS=1
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=16
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages
Invoking script with the following command:
/opt/conda/bin/python train.py --model_name google/vit-base-patch16-224-in21k --num_train_epochs 1 --per_device_train_batch_size 16
Traceback (most recent call last):
File “train.py”, line 61, in
train_dataset = load_from_disk(args.training_dir)
File “/opt/conda/lib/python3.8/site-packages/datasets/load.py”, line 1685, in load_from_disk
return Dataset.load_from_disk(dataset_path, fs, keep_in_memory=keep_in_memory)
File “/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1057, in load_from_disk
dataset_info = DatasetInfo.from_dict(json.load(dataset_info_file))
File “/opt/conda/lib/python3.8/site-packages/datasets/info.py”, line 261, in from_dict
return cls(**{k: v for k, v in dataset_info_dict.items() if k in field_names})
File “”, line 20, in init
File “/opt/conda/lib/python3.8/site-packages/datasets/info.py”, line 144, in post_init
self.features = Features.from_dict(self.features)
File “/opt/conda/lib/python3.8/site-packages/datasets/features/features.py”, line 1019, in from_dict
obj = generate_from_dict(dic)
File “/opt/conda/lib/python3.8/site-packages/datasets/features/features.py”, line 876, in generate_from_dict
return {key: generate_from_dict(value) for key, value in obj.items()}
File “/opt/conda/lib/python3.8/site-packages/datasets/features/features.py”, line 876, in
return {key: generate_from_dict(value) for key, value in obj.items()}
File “/opt/conda/lib/python3.8/site-packages/datasets/features/features.py”, line 877, in generate_from_dict
class_type = globals()[obj.pop(“_type”)]
KeyError: ‘Image’
2022-09-30 09:11:48,904 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2022-09-30 09:11:48,904 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2022-09-30 09:11:48,904 sagemaker-training-toolkit ERROR Reporting training FAILURE
2022-09-30 09:11:48,904 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "KeyError: ‘Image’
"
Command “/opt/conda/bin/python train.py --model_name google/vit-base-patch16-224-in21k --num_train_epochs 1 --per_device_train_batch_size 16”
2022-09-30 09:11:48,904 sagemaker-training-toolkit ERROR Encountered exit_code 1

Hi there

It’s worth noting that being able to load the dataset in the notebook instance does NOT mean you can successfully load it with the training script.

The reason for that is that the training script runs in a seperate EC2 instance that has no knowledge of your notebook instance. This is by design: You want a small, cheap notebook instance to orchestrate the data prep and training setup but you (potentially) want a powerful, expensive instance to run the actual training on. To learn more about training HF models on SageMaker, have a look at this example: notebooks/sagemaker/01_getting_started_pytorch at main · huggingface/notebooks · GitHub

What does that mean for your particular case? Without seeing the notebook where you orchestrate the setup I can only guess, but it looks like you have either (a) not stored the dataset in the correct S3 bucket or (b) have not told the training job the correct S3 path for the training job.

Again, check out @philschmid’s example notebook, that should give you an idea how to pass on dataset paths to the training job.

Cheers
Heiko