Trying to train and deploy a Vision Transformer model in Sagemaker using huggingface. I have used tried the Huggingface example [notebook on the matter](https://github.com/huggingface/notebooks/tree/main/sagemaker/09_image_classification_vision_transformer)
and still getting the same error: KeyError ‘Image’:
The output is saying that the error is in the load_from_disk function, but it doesn’t make sense to me: With the same path, when performing load_from_disk outside of the training script (meaning, in the notebook I am trying to invoke the training script from), the dataset loads fine.
This is the full output:
2022-09-30 09:08:29 Training - Downloading the training image…bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-09-30 09:11:42,795 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-09-30 09:11:42,830 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-09-30 09:11:42,837 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-09-30 09:11:43,425 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
“additional_framework_parameters”: {},
“channel_input_dirs”: {
“test”: “/opt/ml/input/data/test”,
“train”: “/opt/ml/input/data/train”
},
“current_host”: “algo-1”,
“framework_module”: “sagemaker_pytorch_container.training:main”,
“hosts”: [
“algo-1”
],
“hyperparameters”: {
“model_name”: “google/vit-base-patch16-224-in21k”,
“num_train_epochs”: 1,
“per_device_train_batch_size”: 16
},
“input_config_dir”: “/opt/ml/input/config”,
“input_data_config”: {
“test”: {
“TrainingInputMode”: “File”,
“S3DistributionType”: “FullyReplicated”,
“RecordWrapperType”: “None”
},
“train”: {
“TrainingInputMode”: “File”,
“S3DistributionType”: “FullyReplicated”,
“RecordWrapperType”: “None”
}
},
“input_dir”: “/opt/ml/input”,
“is_master”: true,
“job_name”: “huggingface-pytorch-training-2022-09-30-09-04-04-385”,
“log_level”: 20,
“master_hostname”: “algo-1”,
“model_dir”: “/opt/ml/model”,
“module_dir”: “s3://sagemaker-eu-central-1-232977032562/huggingface-pytorch-training-2022-09-30-09-04-04-385/source/sourcedir.tar.gz”,
“module_name”: “train”,
“network_interface_name”: “eth0”,
“num_cpus”: 8,
“num_gpus”: 1,
“output_data_dir”: “/opt/ml/output/data”,
“output_dir”: “/opt/ml/output”,
“output_intermediate_dir”: “/opt/ml/output/intermediate”,
“resource_config”: {
“current_host”: “algo-1”,
“current_instance_type”: “ml.p3.2xlarge”,
“current_group_name”: “homogeneousCluster”,
“hosts”: [
“algo-1”
],
“instance_groups”: [
{
“instance_group_name”: “homogeneousCluster”,
“instance_type”: “ml.p3.2xlarge”,
“hosts”: [
“algo-1”
]
}
],
“network_interface_name”: “eth0”
},
“user_entry_point”: “train.py”
}
Environment variables:
SM_HOSTS=[“algo-1”]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={“model_name”:“google/vit-base-patch16-224-in21k”,“num_train_epochs”:1,“per_device_train_batch_size”:16}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={“current_group_name”:“homogeneousCluster”,“current_host”:“algo-1”,“current_instance_type”:“ml.p3.2xlarge”,“hosts”:[“algo-1”],“instance_groups”:[{“hosts”:[“algo-1”],“instance_group_name”:“homogeneousCluster”,“instance_type”:“ml.p3.2xlarge”}],“network_interface_name”:“eth0”}
SM_INPUT_DATA_CONFIG={“test”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”},“train”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[“test”,“train”]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-central-1-232977032562/huggingface-pytorch-training-2022-09-30-09-04-04-385/source/sourcedir.tar.gz
SM_TRAINING_ENV={“additional_framework_parameters”:{},“channel_input_dirs”:{“test”:“/opt/ml/input/data/test”,“train”:“/opt/ml/input/data/train”},“current_host”:“algo-1”,“framework_module”:“sagemaker_pytorch_container.training:main”,“hosts”:[“algo-1”],“hyperparameters”:{“model_name”:“google/vit-base-patch16-224-in21k”,“num_train_epochs”:1,“per_device_train_batch_size”:16},“input_config_dir”:“/opt/ml/input/config”,“input_data_config”:{“test”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”},“train”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”}},“input_dir”:“/opt/ml/input”,“is_master”:true,“job_name”:“huggingface-pytorch-training-2022-09-30-09-04-04-385”,“log_level”:20,“master_hostname”:“algo-1”,“model_dir”:“/opt/ml/model”,“module_dir”:“s3://sagemaker-eu-central-1-232977032562/huggingface-pytorch-training-2022-09-30-09-04-04-385/source/sourcedir.tar.gz”,“module_name”:“train”,“network_interface_name”:“eth0”,“num_cpus”:8,“num_gpus”:1,“output_data_dir”:“/opt/ml/output/data”,“output_dir”:“/opt/ml/output”,“output_intermediate_dir”:“/opt/ml/output/intermediate”,“resource_config”:{“current_group_name”:“homogeneousCluster”,“current_host”:“algo-1”,“current_instance_type”:“ml.p3.2xlarge”,“hosts”:[“algo-1”],“instance_groups”:[{“hosts”:[“algo-1”],“instance_group_name”:“homogeneousCluster”,“instance_type”:“ml.p3.2xlarge”}],“network_interface_name”:“eth0”},“user_entry_point”:“train.py”}
SM_USER_ARGS=[“–model_name”,“google/vit-base-patch16-224-in21k”,“–num_train_epochs”,“1”,“–per_device_train_batch_size”,“16”]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_MODEL_NAME=google/vit-base-patch16-224-in21k
SM_HP_NUM_TRAIN_EPOCHS=1
SM_HP_PER_DEVICE_TRAIN_BATCH_SIZE=16
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages
Invoking script with the following command:
/opt/conda/bin/python train.py --model_name google/vit-base-patch16-224-in21k --num_train_epochs 1 --per_device_train_batch_size 16
Traceback (most recent call last):
File “train.py”, line 61, in
train_dataset = load_from_disk(args.training_dir)
File “/opt/conda/lib/python3.8/site-packages/datasets/load.py”, line 1685, in load_from_disk
return Dataset.load_from_disk(dataset_path, fs, keep_in_memory=keep_in_memory)
File “/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1057, in load_from_disk
dataset_info = DatasetInfo.from_dict(json.load(dataset_info_file))
File “/opt/conda/lib/python3.8/site-packages/datasets/info.py”, line 261, in from_dict
return cls(**{k: v for k, v in dataset_info_dict.items() if k in field_names})
File “”, line 20, in init
File “/opt/conda/lib/python3.8/site-packages/datasets/info.py”, line 144, in post_init
self.features = Features.from_dict(self.features)
File “/opt/conda/lib/python3.8/site-packages/datasets/features/features.py”, line 1019, in from_dict
obj = generate_from_dict(dic)
File “/opt/conda/lib/python3.8/site-packages/datasets/features/features.py”, line 876, in generate_from_dict
return {key: generate_from_dict(value) for key, value in obj.items()}
File “/opt/conda/lib/python3.8/site-packages/datasets/features/features.py”, line 876, in
return {key: generate_from_dict(value) for key, value in obj.items()}
File “/opt/conda/lib/python3.8/site-packages/datasets/features/features.py”, line 877, in generate_from_dict
class_type = globals()[obj.pop(“_type”)]
KeyError: ‘Image’
2022-09-30 09:11:48,904 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2022-09-30 09:11:48,904 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2022-09-30 09:11:48,904 sagemaker-training-toolkit ERROR Reporting training FAILURE
2022-09-30 09:11:48,904 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "KeyError: ‘Image’
"
Command “/opt/conda/bin/python train.py --model_name google/vit-base-patch16-224-in21k --num_train_epochs 1 --per_device_train_batch_size 16”
2022-09-30 09:11:48,904 sagemaker-training-toolkit ERROR Encountered exit_code 1