KeyError: '_data' when training on AWS

Hi all,

I’ve been working through adapting the getting started notebook to my particular use case. I wrote out my data to s3, and kicked off .fit(), but am getting this error block:

2021-04-23 04:58:40,552 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"
Traceback (most recent call last):
  File "train.py", line 42, in <module>
    train_dataset = load_from_disk(args.training_dir)
  File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 781, in load_from_disk
    return Dataset.load_from_disk(dataset_path, fs)
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in load_from_disk
    state = {k: state[k] for k in dataset.__dict__.keys()}  # in case we add new fields
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in <dictcomp>
    state = {k: state[k] for k in dataset.__dict__.keys()}  # in case we add new fields
KeyError: '_data'

What’s leaving me scratching my head is that when I reference the arrow_dataset.py file, I can’t find lines of this kind, making me think there’s some discrepancy in whatever AWS’ container is and that file.

Regardless, does anyone have any advice/intuition on what may be going on here? I don’t know what the ‘_data’ key would refer to in this case, and am looking for help. Thanks!

1 Like

Hey @cccx3,

Thank you for creating this topic! There was an error in the 01_getting_started_pytorch where it installed datasets 1.6.0, which has some changes, and in the DLC currently 1.5.0 is installed.
This was already fixed in a PR yesterday.

To solve this you need to install datasets==1.5.0 in your notebook and pre-process the data again.

!pip install "sagemaker>=2.31.0" "transformers==4.4.2" "datasets[s3]==1.5.0" --upgrade
1 Like

I had the same error and this fixed it for me, thanks!

Do you know if/when datasets will be updated to 1.6.0 in the HuggingFace DLC?

We are going to include it with the next transformers release (4.6.0).

If you want to use it in advance you can add a requirements.txt in your source_dir with datasets==1.6.1 in it.

2 Likes

I’m facing the same issue with following lib versions:
datasets == 1.11.0
sagemaker == 2.48.1

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-08-19-12-34-40-568: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command “/opt/conda/bin/python3.6 train.py --epochs 3 --model_name bert-base-uncased --train_batch_size 16”
Traceback (most recent call last):
File “train.py”, line 41, in
train_dataset = load_from_disk(args.training_dir)
File “/opt/conda/lib/python3.6/site-packages/datasets/load.py”, line 781, in load_from_disk
return Dataset.load_from_disk(dataset_path, fs)
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 684, in load_from_disk
state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 684, in
state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields
KeyError: ‘_data’

Is datasets version 1.5.0 strictly necessary?

Hey @anikalburgi,

Since the creation of this topic we released new DLC container, take a look here: Hugging Face on Amazon SageMaker
You can use transformers_version=4.6.1 in your Estimator to use the image with datasets 1.6.2.

Perfect, Thanks!
This works!
Although, I’m facing CUDA Out of Memory issue with batch size even as low as 8 on p3.2xlarge instance for bert-base-uncased.
Usually, when trained from torch directly and outside sagemaker, the batch size of 14-16 safely works for me on both g4dn.xlarge and p3.2xlarge instances.

Can you share your HuggingFace estimator and the script you use?

Nah, It was my bad,
EOD rush - added _'s instead of -'s in batch size arg
It’s fixed and working smoothly!

1 Like

Thank you @philschmid

1 Like