Error for Training job huggingface-sdk-extension-2022-01-24-16-31-30-883: Failed. Reason: AlgorithmError: ExecuteUserScriptError:

hswamy · January 24, 2022, 4:56pm

Hi All I am getting the below error msg when trying to train Bert , any help will be great. Bit urgent.

error:-
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.

Error for Training job huggingface-sdk-extension-2022-01-24-16-47-13-971: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command “/opt/conda/bin/python3.6 train.py --epochs 2 --model_name distilbert-base-uncased --train_batch_size 32”

LOG:-

2022-01-24 16:53:12,880 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-01-24 16:53:12,904 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-01-24 16:53:15,927 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-01-24 16:53:16,395 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
“additional_framework_parameters”: {},
“channel_input_dirs”: {
“test”: “/opt/ml/input/data/test”,
“train”: “/opt/ml/input/data/train”
},
“current_host”: “algo-1”,
“framework_module”: “sagemaker_pytorch_container.training:main”,
“hosts”: [
“algo-1”
],
“hyperparameters”: {
“train_batch_size”: 32,
“model_name”: “distilbert-base-uncased”,
“epochs”: 2
},
“input_config_dir”: “/opt/ml/input/config”,
“input_data_config”: {
“test”: {
“TrainingInputMode”: “File”,
“S3DistributionType”: “FullyReplicated”,
“RecordWrapperType”: “None”
},
“train”: {
“TrainingInputMode”: “File”,
“S3DistributionType”: “FullyReplicated”,
“RecordWrapperType”: “None”
}
},
“input_dir”: “/opt/ml/input”,
“is_master”: true,
“job_name”: “huggingface-sdk-extension-2022-01-24-16-47-13-971”,
“log_level”: 20,
“master_hostname”: “algo-1”,
“model_dir”: “/opt/ml/model”,
“module_dir”: “s3://sagemaker-eu-west-2-352316401451/huggingface-sdk-extension-2022-01-24-16-47-13-971/source/sourcedir.tar.gz”,
“module_name”: “train”,
“network_interface_name”: “eth0”,
“num_cpus”: 8,
“num_gpus”: 1,
“output_data_dir”: “/opt/ml/output/data”,
“output_dir”: “/opt/ml/output”,
“output_intermediate_dir”: “/opt/ml/output/intermediate”,
“resource_config”: {
“current_host”: “algo-1”,
“hosts”: [
“algo-1”
],
“network_interface_name”: “eth0”
},
“user_entry_point”: “train.py”
}
Environment variables:
SM_HOSTS=[“algo-1”]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={“epochs”:2,“model_name”:“distilbert-base-uncased”,“train_batch_size”:32}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={“current_host”:“algo-1”,“hosts”:[“algo-1”],“network_interface_name”:“eth0”}
SM_INPUT_DATA_CONFIG={“test”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”},“train”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[“test”,“train”]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-west-2-352316401451/huggingface-sdk-extension-2022-01-24-16-47-13-971/source/sourcedir.tar.gz
SM_TRAINING_ENV={“additional_framework_parameters”:{},“channel_input_dirs”:{“test”:"/opt/ml/input/data/test",“train”:"/opt/ml/input/data/train"},“current_host”:“algo-1”,“framework_module”:“sagemaker_pytorch_container.training:main”,“hosts”:[“algo-1”],“hyperparameters”:{“epochs”:2,“model_name”:“distilbert-base-uncased”,“train_batch_size”:32},“input_config_dir”:"/opt/ml/input/config",“input_data_config”:{“test”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”},“train”:{“RecordWrapperType”:“None”,“S3DistributionType”:“FullyReplicated”,“TrainingInputMode”:“File”}},“input_dir”:"/opt/ml/input",“is_master”:true,“job_name”:“huggingface-sdk-extension-2022-01-24-16-47-13-971”,“log_level”:20,“master_hostname”:“algo-1”,“model_dir”:"/opt/ml/model",“module_dir”:“s3://sagemaker-eu-west-2-352316401451/huggingface-sdk-extension-2022-01-24-16-47-13-971/source/sourcedir.tar.gz”,“module_name”:“train”,“network_interface_name”:“eth0”,“num_cpus”:8,“num_gpus”:1,“output_data_dir”:"/opt/ml/output/data",“output_dir”:"/opt/ml/output",“output_intermediate_dir”:"/opt/ml/output/intermediate",“resource_config”:{“current_host”:“algo-1”,“hosts”:[“algo-1”],“network_interface_name”:“eth0”},“user_entry_point”:“train.py”}
SM_USER_ARGS=["–epochs",“2”,"–model_name",“distilbert-base-uncased”,"–train_batch_size",“32”]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_TRAIN_BATCH_SIZE=32
SM_HP_MODEL_NAME=distilbert-base-uncased
SM_HP_EPOCHS=2
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.6 train.py --epochs 2 --model_name distilbert-base-uncased --train_batch_size 32
2022-01-24 16:53:21,122 - main - INFO - loaded train_dataset length is: 572
2022-01-24 16:53:21,122 - main - INFO - loaded test_dataset length is: 144
2022-01-24 16:53:21,457 - filelock - INFO - Lock 140311318218288 acquired on /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333.lock
2022-01-24 16:53:21,790 - filelock - INFO - Lock 140311318218288 released on /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333.lock
2022-01-24 16:53:22,163 - filelock - INFO - Lock 140311212156688 acquired on /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a.lock
2022-01-24 16:53:27,475 - filelock - INFO - Lock 140311212156688 released on /root/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a.lock
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: [‘vocab_layer_norm.bias’, ‘vocab_projector.weight’, ‘vocab_projector.bias’, ‘vocab_transform.bias’, ‘vocab_layer_norm.weight’, ‘vocab_transform.weight’]

This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: [‘classifier.weight’, ‘pre_classifier.weight’, ‘pre_classifier.bias’, ‘classifier.bias’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-01-24 16:53:28,847 - filelock - INFO - Lock 140311146924352 acquired on /root/.cache/huggingface/transformers/0e1bbfda7f63a99bb52e3915dcf10c3c92122b827d92eb2d34ce94ee79ba486c.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
2022-01-24 16:53:29,484 - filelock - INFO - Lock 140311146924352 released on /root/.cache/huggingface/transformers/0e1bbfda7f63a99bb52e3915dcf10c3c92122b827d92eb2d34ce94ee79ba486c.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
2022-01-24 16:53:29,811 - filelock - INFO - Lock 140311211720320 acquired on /root/.cache/huggingface/transformers/75abb59d7a06f4f640158a9bfcde005264e59e8d566781ab1415b139d2e4c603.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
2022-01-24 16:53:30,528 - filelock - INFO - Lock 140311211720320 released on /root/.cache/huggingface/transformers/75abb59d7a06f4f640158a9bfcde005264e59e8d566781ab1415b139d2e4c603.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
2022-01-24 16:53:31,518 - filelock - INFO - Lock 140311211719928 acquired on /root/.cache/huggingface/transformers/8c8624b8ac8aa99c60c912161f8332de003484428c47906d7ff7eb7f73eecdbb.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock

2022-01-24 16:53:38 Uploading - Uploading generated training model2022-01-24 16:53:31,850 - filelock - INFO - Lock 140311211719928 released on /root/.cache/huggingface/transformers/8c8624b8ac8aa99c60c912161f8332de003484428c47906d7ff7eb7f73eecdbb.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
[2022-01-24 16:53:36.715 algo-1:26 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-01-24 16:53:36.867 algo-1:26 INFO profiler_config_parser.py:102] User has disabled profiler.
[2022-01-24 16:53:36.868 algo-1:26 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2022-01-24 16:53:36.869 algo-1:26 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2022-01-24 16:53:36.870 algo-1:26 INFO hook.py:255] Saving to /opt/ml/output/tensors
[2022-01-24 16:53:36.871 algo-1:26 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
#015Downloading: 0%| | 0.00/483 [00:00<?, ?B/s]#015Downloading: 100%|██████████| 483/483 [00:00<00:00, 460kB/s]
#015Downloading: 0%| | 0.00/268M [00:00<?, ?B/s]#015Downloading: 2%|▏ | 4.08M/268M [00:00<00:06, 40.8MB/s]#015Downloading: 3%|▎ | 8.71M/268M [00:00<00:06, 42.3MB/s]#015Downloading: 5%|▌ | 13.5M/268M [00:00<00:05, 43.8MB/s]#015Downloading: 7%|▋ | 18.4M/268M [00:00<00:05, 45.2MB/s]#015Downloading: 9%|▊ | 23.3M/268M [00:00<00:05, 46.5MB/s]#015Downloading: 10%|█ | 28.0M/268M [00:00<00:05, 46.5MB/s]#015Downloading: 12%|█▏ | 32.6M/268M [00:00<00:05, 46.5MB/s]#015Downloading: 14%|█▍ | 37.7M/268M [00:00<00:04, 47.5MB/s]#015Downloading: 16%|█▌ | 42.7M/268M [00:00<00:04, 48.4MB/s]#015Downloading: 18%|█▊ | 47.8M/268M [00:01<00:04, 49.1MB/s]#015Downloading: 20%|█▉ | 52.8M/268M [00:01<00:04, 49.2MB/s]#015Downloading: 22%|██▏ | 57.8M/268M [00:01<00:04, 49.6MB/s]#015Downloading: 23%|██▎ | 62.9M/268M [00:01<00:04, 50.0MB/s]#015Downloading: 25%|██▌ | 68.0M/268M [00:01<00:03, 50.3MB/s]#015Downloading: 27%|██▋ | 73.2M/268M [00:01<00:03, 50.7MB/s]#015Downloading: 29%|██▉ | 78.2M/268M [00:01<00:03, 50.5MB/s]#015Downloading: 31%|███ | 83.3M/268M [00:01<00:03, 50.7MB/s]#015Downloading: 33%|███▎ | 88.5M/268M [00:01<00:03, 50.9MB/s]#015Downloading: 35%|███▍ | 93.6M/268M [00:01<00:03, 51.1MB/s]#015Downloading: 37%|███▋ | 98.8M/268M [00:02<00:03, 51.3MB/s]#015Downloading: 39%|███▉ | 104M/268M [00:02<00:03, 50.5MB/s] #015Downloading: 41%|████ | 109M/268M [00:02<00:03, 50.5MB/s]#015Downloading: 43%|████▎ | 114M/268M [00:02<00:03, 50.7MB/s]#015Downloading: 44%|████▍ | 119M/268M [00:02<00:02, 50.8MB/s]#015Downloading: 46%|████▋ | 124M/268M [00:02<00:02, 51.1MB/s]#015Downloading: 48%|████▊ | 130M/268M [00:02<00:02, 51.1MB/s]#015Downloading: 50%|█████ | 135M/268M [00:02<00:02, 51.2MB/s]#015Downloading: 52%|█████▏ | 140M/268M [00:02<00:02, 51.3MB/s]#015Downloading: 54%|█████▍ | 145M/268M [00:02<00:02, 51.3MB/s]#015Downloading: 56%|█████▌ | 150M/268M [00:03<00:02, 51.4MB/s]#015Downloading: 58%|█████▊ | 155M/268M [00:03<00:02, 50.9MB/s]#015Downloading: 60%|█████▉ | 160M/268M [00:03<00:02, 51.0MB/s]#015Downloading: 62%|██████▏ | 166M/268M [00:03<00:02, 51.2MB/s]#015Downloading: 64%|██████▎ | 171M/268M [00:03<00:01, 51.5MB/s]#015Downloading: 66%|██████▌ | 176M/268M [00:03<00:01, 52.1MB/s]#015Downloading: 68%|██████▊ | 181M/268M [00:03<00:01, 52.5MB/s]#015Downloading: 70%|██████▉ | 187M/268M [00:03<00:01, 52.9MB/s]#015Downloading: 72%|███████▏ | 192M/268M [00:03<00:01, 52.0MB/s]#015Downloading: 74%|███████▎ | 198M/268M [00:03<00:01, 52.5MB/s]#015Downloading: 76%|███████▌ | 203M/268M [00:04<00:01, 52.9MB/s]#015Downloading: 78%|███████▊ | 208M/268M [00:04<00:01, 49.9MB/s]#015Downloading: 80%|███████▉ | 213M/268M [00:04<00:01, 50.8MB/s]#015Downloading: 82%|████████▏ | 219M/268M [00:04<00:00, 51.4MB/s]#015Downloading: 84%|████████▎ | 224M/268M [00:04<00:00, 52.0MB/s]#015Downloading: 86%|████████▌ | 229M/268M [00:04<00:00, 52.4MB/s]#015Downloading: 88%|████████▊ | 235M/268M [00:04<00:00, 52.7MB/s]#015Downloading: 90%|████████▉ | 240M/268M [00:04<00:00, 52.8MB/s]#015Downloading: 92%|█████████▏| 245M/268M [00:04<00:00, 52.9MB/s]#015Downloading: 94%|█████████▎| 251M/268M [00:04<00:00, 53.0MB/s]#015Downloading: 96%|█████████▌| 256M/268M [00:05<00:00, 53.1MB/s]#015Downloading: 98%|█████████▊| 261M/268M [00:05<00:00, 52.7MB/s]#015Downloading: 100%|█████████▉| 267M/268M [00:05<00:00, 51.6MB/s]#015Downloading: 100%|██████████| 268M/268M [00:05<00:00, 50.9MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: [‘vocab_layer_norm.bias’, ‘vocab_projector.weight’, ‘vocab_projector.bias’, ‘vocab_transform.bias’, ‘vocab_layer_norm.weight’, ‘vocab_transform.weight’]

This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: [‘classifier.weight’, ‘pre_classifier.weight’, ‘pre_classifier.bias’, ‘classifier.bias’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
#015Downloading: 0%| | 0.00/232k [00:00<?, ?B/s]#015Downloading: 16%|█▌ | 36.9k/232k [00:00<00:00, 242kB/s]#015Downloading: 82%|████████▏ | 190k/232k [00:00<00:00, 321kB/s] #015Downloading: 100%|██████████| 232k/232k [00:00<00:00, 755kB/s]
#015Downloading: 0%| | 0.00/466k [00:00<?, ?B/s]#015Downloading: 20%|█▉ | 92.2k/466k [00:00<00:00, 610kB/s]#015Downloading: 73%|███████▎ | 338k/466k [00:00<00:00, 750kB/s] #015Downloading: 100%|██████████| 466k/466k [00:00<00:00, 1.52MB/s]
#015Downloading: 0%| | 0.00/28.0 [00:00<?, ?B/s]#015Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 24.3kB/s]
#015 0%| | 0/36 [00:00<?, ?it/s]Traceback (most recent call last):
File “/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 699, in convert_to_tensors
tensor = as_tensor(value)
ValueError: too many dimensions ‘str’
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “train.py”, line 83, in
trainer.train()
File “/opt/conda/lib/python3.6/site-packages/transformers/trainer.py”, line 1246, in train
for step, inputs in enumerate(epoch_iterator):
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 444, in next
(data, worker_id) = self._next_data()
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 526, in _next_data
2022-01-24 16:53:37,616 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
Command “/opt/conda/bin/python3.6 train.py --epochs 2 --model_name distilbert-base-uncased --train_batch_size 32”
#015Downloading: 0%| | 0.00/483 [00:00<?, ?B/s]#015Downloading: 100%|ââââââââââ| 483/483 [00:00<00:00, 460kB/s]
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
#015Downloading: 0%| | 0.00/268M [00:00<?, ?B/s]#015Downloading: 2%|â | 4.08M/268M [00:00<00:06, 40.8MB/s]#015Downloading: 3%|â | 8.71M/268M [00:00<00:06, 42.3MB/s]#015Downloading: 5%|â | 13.5M/268M [00:00<00:05, 43.8MB/s]#015Downloading: 7%|â | 18.4M/268M [00:00<00:05, 45.2MB/s]#015Downloading: 9%|â | 23.3M/268M [00:00<00:05, 46.5MB/s]#015Downloading: 10%|â | 28.0M/268M [00:00<00:05, 46.5MB/s]#015Downloading: 12%|ââ | 32.6M/268M [00:00<00:05, 46.5MB/s]#015Downloading: 14%|ââ | 37.7M/268M [00:00<00:04, 47.5MB/s]#015Downloading: 16%|ââ | 42.7M/268M [00:00<00:04, 48.4MB/s]#015Downloading: 18%|ââ | 47.8M/268M [00:01<00:04, 49.1MB/s]#015Downloading: 20%|ââ | 52.8M/268M [00:01<00:04, 49.2MB/s]#015Downloading: 22%|âââ | 57.8M/268M [00:01<00:04, 49.6MB/s]#015Downloading: 23%|âââ | 62.9M/268M [00:01<00:04, 50.0MB/s]#015Downloading: 25%|âââ | 68.0M/268M [00:01<00:03, 50.3MB/s]#015Downloading: 27%|âââ | 73.2M/268M [00:01<00:03, 50.7MB/s]#015Downloading: 29%|âââ | 78.2M/268M [00:01<00:03, 50.5MB/s]#015Downloading: 31%|âââ | 83.3M/268M [00:01<00:03, 50.7MB/s]#015Downloading: 33%|ââââ | 88.5M/268M [00:01<00:03, 50.9MB/s]#015Downloading: 35%|ââââ | 93.6M/268M [00:01<00:03, 51.1MB/s]#015Downloading: 37%|ââââ | 98.8M/268M [00:02<00:03, 51.3MB/s]#015Downloading: 39%|ââââ | 104M/268M [00:02<00:03, 50.5MB/s] #015Downloading: 41%|ââââ | 109M/268M [00:02<00:03, 50.5MB/s]#015Downloading: 43%|âââââ | 114M/268M [00:02<00:03, 50.7MB/s]#015Downloading: 44%|âââââ | 119M/268M [00:02<00:02, 50.8MB/s]#015Downloading: 46%|âââââ | 124M/268M [00:02<00:02, 51.1MB/s]#015Downloading: 48%|âââââ | 130M/268M [00:02<00:02, 51.1MB/s]#015Downloading: 50%|âââââ | 135M/268M [00:02<00:02, 51.2MB/s]#015Downloading: 52%|ââââââ | 140M/268M [00:02<00:02, 51.3MB/s]#015Downloading: 54%|ââââââ | 145M/268M [00:02<00:02, 51.3MB/s]#015Downloading: 56%|ââââââ | 150M/268M [00:03<00:02, 51.4MB/s]#015Downloading: 58%|ââââââ | 155M/268M [00:03<00:02, 50.9MB/s]#015Downloading: 60%|ââââââ | 160M/268M [00:03<00:02, 51.0MB/s]#015Downloading: 62%|âââââââ | 166M/268M [00:03<00:02, 51.2MB/s]#015Downloading: 64%|âââââââ | 171M/268M [00:03<00:01, 51.5MB/s]#015Downloading: 66%|âââââââ | 176M/268M [00:03<00:01, 52.1MB/s]#015Downloading: 68%|âââââââ | 181M/268M [00:03<00:01, 52.5MB/s]#015Downloading: 70%|âââââââ | 187M/268M [00:03<00:01, 52.9MB/s]#015Downloading: 72%|ââââââââ | 192M/268M [00:03<00:01, 52.0MB/s]#015Downloading: 74%|ââââââââ | 198M/268M [00:03<00:01, 52.5MB/s]#015Downloading: 76%|ââââââââ | 203M/268M [00:04<00:01, 52.9MB/s]#015Downloading: 78%|ââââââââ | 208M/268M [00:04<00:01, 49.9MB/s]#015Downloading: 80%|ââââââââ | 213M/268M [00:04<00:01, 50.8MB/s]#015Downloading: 82%|âââââââââ | 219M/268M [00:04<00:00, 51.4MB/s]#015Downloading: 84%|âââââââââ | 224M/268M [00:04<00:00, 52.0MB/s]#015Downloading: 86%|âââââââââ | 229M/268M [00:04<00:00, 52.4MB/s]#015Downloading: 88%|âââââââââ | 235M/268M [00:04<00:00, 52.7MB/s]#015Downloading: 90%|âââââââââ | 240M/268M [00:04<00:00, 52.8MB/s]#015Downloading: 92%|ââââââââââ| 245M/268M [00:04<00:00, 52.9MB/s]#015Downloading: 94%|ââââââââââ| 251M/268M [00:04<00:00, 53.0MB/s]#015Downloading: 96%|ââââââââââ| 256M/268M [00:05<00:00, 53.1MB/s]#015Downloading: 98%|ââââââââââ| 261M/268M [00:05<00:00, 52.7MB/s]#015Downloading: 100%|ââââââââââ| 267M/268M [00:05<00:00, 51.6MB/s]#015Downloading: 100%|ââââââââââ| 268M/268M [00:05<00:00, 50.9MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: [‘vocab_layer_norm.bias’, ‘vocab_projector.weight’, ‘vocab_projector.bias’, ‘vocab_transform.bias’, ‘vocab_layer_norm.weight’, ‘vocab_transform.weight’]
This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: [‘classifier.weight’, ‘pre_classifier.weight’, ‘pre_classifier.bias’, ‘classifier.bias’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
#015Downloading: 0%| | 0.00/232k [00:00<?, ?B/s]#015Downloading: 16%|ââ | 36.9k/232k [00:00<00:00, 242kB/s]#015Downloading: 82%|âââââââââ | 190k/232k [00:00<00:00, 321kB/s] #015Downloading: 100%|ââââââââââ| 232k/232k [00:00<00:00, 755kB/s]
#015Downloading: 0%| | 0.00/466k [00:00<?, ?B/s]#015Downloading: 20%|ââ | 92.2k/466k [00:00<00:00, 610kB/s]#015Downloading: 73%|ââââââââ | 338k/466k [00:00<00:00, 750kB/s] #015Downloading: 100%|ââââââââââ| 466k/466k [00:00<00:00, 1.52MB/s]
#015Downloading: 0%| | 0.00/28.0 [00:00<?, ?B/s]#015Downloading: 100%|ââââââââââ| 28.0/28.0 [00:00<00:00, 24.3kB/s]
#015 0%| | 0/36 [00:00<?, ?it/s]Traceback (most recent call last):
File “/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 699, in convert_to_tensors
tensor = as_tensor(value)
ValueError: too many dimensions ‘str’
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “train.py”, line 83, in
trainer.train()
File “/opt/conda/lib/python3.6/site-packages/transformers/trainer.py”, line 1246, in train
for step, inputs in enumerate(epoch_iterator):
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 444, in next
(data, worker_id) = self._next_data()
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 526, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py”, line 47, in fetch
return self.collate_fn(data)
File “/opt/conda/lib/python3.6/site-packages/transformers/data/data_collator.py”, line 123, in call
return_tensors=“pt”,
File “/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 2680, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File “/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 204, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py”, line 47, in fetch
return self.collate_fn(data)
File “/opt/conda/lib/python3.6/site-packages/transformers/data/data_collator.py”, line 123, in call
return_tensors=“pt”,
File “/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 2680, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File “/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 204, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File “/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 716, in convert_to_tensors
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.
#015 0%| | 0/36 [00:00<?, ?it/s]
File “/opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 716, in convert_to_tensors
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.
#015 0%| | 0/36 [00:00<?, ?it/s]

2022-01-24 16:54:35 Failed - Training job failed
ProfilerReport-1643042834: Stopping

UnexpectedStatusException Traceback (most recent call last)
in
10
11 # starting the train job with our uploaded datasets as input
—> 12 huggingface_estimator.fit({‘train’: training_input_path, ‘test’: test_input_path})

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
690 self.jobs.append(self.latest_training_job)
691 if wait:
→ 692 self.latest_training_job.wait(logs=logs)
693
694 def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1665 # If logs are requested, call logs_for_jobs.
1666 if logs != “None”:
→ 1667 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1668 else:
1669 self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3783
3784 if wait:
→ 3785 self._check_job_status(job_name, description, “TrainingJobStatus”)
3786 if dot:
3787 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3341 ),
3342 allowed_statuses=[“Completed”, “Stopped”],
→ 3343 actual_status=status,
3344 )
3345

UnexpectedStatusException: Error for Training job huggingface-sdk-extension-2022-01-24-16-47-13-971: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command “/opt/conda/bin/python3.6 train.py --epochs 2 --model_name distilbert-base-uncased --train_batch_size 32”
Downloading: 100%|ââââââââââ| 483/483 [00:00<00:00, 460kB/s]
Downloading: 18%|ââ | 47.8M/268M [00:01<00:04, 49.1MB/s]

hswamy · January 25, 2022, 11:49am

Hi All any updates?

Topic		Replies	Views
Sage maker training failing for many models Models	0	160	June 23, 2024
KeyError: "length" - load_from_disk Training Model on AWS SageMaker 🤗Datasets	4	2237	May 5, 2023
Training on Sagemaker with Trainer() Instance Amazon SageMaker	6	2279	November 3, 2021
Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	0	1007	October 22, 2022
ModelError when I run predict after deploying wizardcoder for text-generation Amazon SageMaker	1	926	September 25, 2023

Error for Training job huggingface-sdk-extension-2022-01-24-16-31-30-883: Failed. Reason: AlgorithmError: ExecuteUserScriptError:

2022-01-24 16:54:35 Failed - Training job failed ProfilerReport-1643042834: Stopping

Related topics

2022-01-24 16:54:35 Failed - Training job failed
ProfilerReport-1643042834: Stopping