I’ve been working through adapting the getting started notebook to my particular use case. I wrote out my data to s3, and kicked off .fit(), but am getting this error block:
2021-04-23 04:58:40,552 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"
Traceback (most recent call last):
File "train.py", line 42, in <module>
train_dataset = load_from_disk(args.training_dir)
File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 781, in load_from_disk
return Dataset.load_from_disk(dataset_path, fs)
File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in load_from_disk
state = {k: state[k] for k in dataset.__dict__.keys()} # in case we add new fields
File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in <dictcomp>
state = {k: state[k] for k in dataset.__dict__.keys()} # in case we add new fields
KeyError: '_data'
What’s leaving me scratching my head is that when I reference the arrow_dataset.py file, I can’t find lines of this kind, making me think there’s some discrepancy in whatever AWS’ container is and that file.
Regardless, does anyone have any advice/intuition on what may be going on here? I don’t know what the ‘_data’ key would refer to in this case, and am looking for help. Thanks!
Thank you for creating this topic! There was an error in the 01_getting_started_pytorch where it installed datasets 1.6.0, which has some changes, and in the DLC currently 1.5.0 is installed.
This was already fixed in a PR yesterday.
To solve this you need to install datasets==1.5.0 in your notebook and pre-process the data again.
I’m facing the same issue with following lib versions:
datasets == 1.11.0
sagemaker == 2.48.1
UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-08-19-12-34-40-568: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command “/opt/conda/bin/python3.6 train.py --epochs 3 --model_name bert-base-uncased --train_batch_size 16”
Traceback (most recent call last):
File “train.py”, line 41, in
train_dataset = load_from_disk(args.training_dir)
File “/opt/conda/lib/python3.6/site-packages/datasets/load.py”, line 781, in load_from_disk
return Dataset.load_from_disk(dataset_path, fs)
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 684, in load_from_disk
state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 684, in
state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields
KeyError: ‘_data’
Since the creation of this topic we released new DLC container, take a look here: Hugging Face on Amazon SageMaker
You can use transformers_version=4.6.1 in your Estimator to use the image with datasets 1.6.2.
Perfect, Thanks!
This works!
Although, I’m facing CUDA Out of Memory issue with batch size even as low as 8 on p3.2xlarge instance for bert-base-uncased.
Usually, when trained from torch directly and outside sagemaker, the batch size of 14-16 safely works for me on both g4dn.xlarge and p3.2xlarge instances.