KeyError: 'validation' when trying to use validation dataset

Hello,

When loading my dataset using the following code:

dataset = load_dataset('csv', data_files={'train': ['/content/drive/data.csv'],
                                              'validation': '/content/drive/data.csv'})

I try to execute the following code:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

But then I get this error for some reason, which makes no sense:

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-35-a5580f1a4d07> in <module>()
      3     args=training_args,
      4     train_dataset=lm_datasets["train"],
----> 5     eval_dataset=lm_datasets["validation"],
      6     data_collator=data_collator,
      7 )

/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py in __getitem__(self, k)
     35     def __getitem__(self, k) -> Dataset:
     36         if isinstance(k, (str, NamedSplit)) or len(self) == 0:
---> 37             return super().__getitem__(k)
     38         else:
     39             available_suggested_splits = [

KeyError: 'validation'

I’m getting the same error testing:

train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator)

Ant it returns this error:

KeyError                                  Traceback (most recent call last)
Cell In [4], line 23
     22 train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator)
---> 23 eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator)

File C:\Python\Python310\lib\site-packages\datasets\dataset_dict.py:51, in DatasetDict.__getitem__(self, k)
     49 def __getitem__(self, k) -> Dataset:
     50     if isinstance(k, (str, NamedSplit)) or len(self) == 0:
---> 51         return super().__getitem__(k)
     52     else:
     53         available_suggested_splits = [
     54             split for split in (Split.TRAIN, Split.TEST, Split.VALIDATION) if split in self
     55         ]

KeyError: 'validation'

Hi! It looks like validation is not one of the keys in your dataset dict. What does tokenized_datasets.keys() print?

Sorry I never responded. I never fixed this. I started over with my code and a different direction and the error went away. Here are the two things I changed:

  1. I was using multiple libraries to wrangle the dataset. I got rid of the others and only used dataset.

  2. I stopped using anything that referred to validation and used train & test only.