KeyError: '_data' when training on AWS

Hi all,

I’ve been working through adapting the getting started notebook to my particular use case. I wrote out my data to s3, and kicked off .fit(), but am getting this error block:

2021-04-23 04:58:40,552 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32"
Traceback (most recent call last):
  File "train.py", line 42, in <module>
    train_dataset = load_from_disk(args.training_dir)
  File "/opt/conda/lib/python3.6/site-packages/datasets/load.py", line 781, in load_from_disk
    return Dataset.load_from_disk(dataset_path, fs)
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in load_from_disk
    state = {k: state[k] for k in dataset.__dict__.keys()}  # in case we add new fields
  File "/opt/conda/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 684, in <dictcomp>
    state = {k: state[k] for k in dataset.__dict__.keys()}  # in case we add new fields
KeyError: '_data'

What’s leaving me scratching my head is that when I reference the arrow_dataset.py file, I can’t find lines of this kind, making me think there’s some discrepancy in whatever AWS’ container is and that file.

Regardless, does anyone have any advice/intuition on what may be going on here? I don’t know what the ‘_data’ key would refer to in this case, and am looking for help. Thanks!

1 Like