KeyError: '_data' when training on AWS

Hi all,

I’ve been working through adapting the getting started notebook to my particular use case. I wrote out my data to s3, and kicked off .fit(), but am getting this error block:

2021-04-23 04:58:40,552 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"
Traceback (most recent call last):
  File "", line 42, in <module>
    train_dataset = load_from_disk(args.training_dir)
  File "/opt/conda/lib/python3.6/site-packages/datasets/", line 781, in load_from_disk
    return Dataset.load_from_disk(dataset_path, fs)
  File "/opt/conda/lib/python3.6/site-packages/datasets/", line 684, in load_from_disk
    state = {k: state[k] for k in dataset.__dict__.keys()}  # in case we add new fields
  File "/opt/conda/lib/python3.6/site-packages/datasets/", line 684, in <dictcomp>
    state = {k: state[k] for k in dataset.__dict__.keys()}  # in case we add new fields
KeyError: '_data'

What’s leaving me scratching my head is that when I reference the file, I can’t find lines of this kind, making me think there’s some discrepancy in whatever AWS’ container is and that file.

Regardless, does anyone have any advice/intuition on what may be going on here? I don’t know what the ‘_data’ key would refer to in this case, and am looking for help. Thanks!

Hey @cccx3,

Thank you for creating this topic! There was an error in the 01_getting_started_pytorch where it installed datasets 1.6.0, which has some changes, and in the DLC currently 1.5.0 is installed.
This was already fixed in a PR yesterday.

To solve this you need to install datasets==1.5.0 in your notebook and pre-process the data again.

!pip install "sagemaker>=2.31.0" "transformers==4.4.2" "datasets[s3]==1.5.0" --upgrade
1 Like

I had the same error and this fixed it for me, thanks!

Do you know if/when datasets will be updated to 1.6.0 in the HuggingFace DLC?

We are going to include it with the next transformers release (4.6.0).

If you want to use it in advance you can add a requirements.txt in your source_dir with datasets==1.6.1 in it.