KeyError: '_data' when training on AWS

cccx3 · April 23, 2021, 5:07am

Hi all,

I’ve been working through adapting the getting started notebook to my particular use case. I wrote out my data to s3, and kicked off .fit(), but am getting this error block:

2021-04-23 04:58:40,552 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"
Traceback (most recent call last):
  File "train.py", line 42, in <module>
    train_dataset = load_from_disk(args.training_dir)
  File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 781, in load_from_disk
    return Dataset.load_from_disk(dataset_path, fs)
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in load_from_disk
    state = {k: state[k] for k in dataset.__dict__.keys()}  # in case we add new fields
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in <dictcomp>
    state = {k: state[k] for k in dataset.__dict__.keys()}  # in case we add new fields
KeyError: '_data'

What’s leaving me scratching my head is that when I reference the arrow_dataset.py file, I can’t find lines of this kind, making me think there’s some discrepancy in whatever AWS’ container is and that file.

Regardless, does anyone have any advice/intuition on what may be going on here? I don’t know what the ‘_data’ key would refer to in this case, and am looking for help. Thanks!

philschmid · April 23, 2021, 5:47am

Hey @cccx3,

Thank you for creating this topic! There was an error in the 01_getting_started_pytorch where it installed datasets 1.6.0, which has some changes, and in the DLC currently 1.5.0 is installed.
This was already fixed in a PR yesterday.

To solve this you need to install datasets==1.5.0 in your notebook and pre-process the data again.

!pip install "sagemaker>=2.31.0" "transformers==4.4.2" "datasets[s3]==1.5.0" --upgrade

nreamaroon · April 26, 2021, 3:32pm

I had the same error and this fixed it for me, thanks!

Do you know if/when datasets will be updated to 1.6.0 in the HuggingFace DLC?

philschmid · April 26, 2021, 3:46pm

We are going to include it with the next transformers release (4.6.0).

If you want to use it in advance you can add a requirements.txt in your source_dir with datasets==1.6.1 in it.

anikalburgi · August 19, 2021, 12:47pm

I’m facing the same issue with following lib versions:
datasets == 1.11.0
sagemaker == 2.48.1

UnexpectedStatusException: Error for Training job huggingface-pytorch-training-2021-08-19-12-34-40-568: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command “/opt/conda/bin/python3.6 train.py --epochs 3 --model_name bert-base-uncased --train_batch_size 16”
Traceback (most recent call last):
File “train.py”, line 41, in
train_dataset = load_from_disk(args.training_dir)
File “/opt/conda/lib/python3.6/site-packages/datasets/load.py”, line 781, in load_from_disk
return Dataset.load_from_disk(dataset_path, fs)
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 684, in load_from_disk
state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields
File “/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py”, line 684, in
state = {k: state[k] for k in dataset.dict.keys()} # in case we add new fields
KeyError: ‘_data’

anikalburgi · August 19, 2021, 12:51pm

Is datasets version 1.5.0 strictly necessary?

philschmid · August 19, 2021, 1:59pm

Hey @anikalburgi,

Since the creation of this topic we released new DLC container, take a look here: Hugging Face on Amazon SageMaker
You can use transformers_version=4.6.1 in your Estimator to use the image with datasets 1.6.2.

anikalburgi · August 19, 2021, 4:51pm

Perfect, Thanks!
This works!
Although, I’m facing CUDA Out of Memory issue with batch size even as low as 8 on p3.2xlarge instance for bert-base-uncased.
Usually, when trained from torch directly and outside sagemaker, the batch size of 14-16 safely works for me on both g4dn.xlarge and p3.2xlarge instances.

philschmid · August 19, 2021, 5:03pm

Can you share your HuggingFace estimator and the script you use?

anikalburgi · August 20, 2021, 1:12pm

Nah, It was my bad,
EOD rush - added _'s instead of -'s in batch size arg
It’s fixed and working smoothly!

anikalburgi · August 20, 2021, 1:13pm

Thank you @philschmid

Topic		Replies	Views
KeyError: "length" - load_from_disk Training Model on AWS SageMaker 🤗Datasets	4	2238	May 5, 2023
KeyError: 'length' when using using load_dataset on Sagemaker Amazon SageMaker	3	1791	April 21, 2024
ValueError: I/O operation on closed file when uploading dataset to S3 bucket Beginners	1	442	June 14, 2024
KeyError: '__index_level_0__' error with datasets arrow_writer.py 🤗Datasets	3	8552	August 29, 2024
Sagemaker pytorch training Awesome paper	0	290	April 9, 2024

KeyError: '_data' when training on AWS

Related topics