Hello,
When loading my dataset using the following code:
dataset = load_dataset('csv', data_files={'train': ['/content/drive/data.csv'],
'validation': '/content/drive/data.csv'})
I try to execute the following code:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"],
eval_dataset=lm_datasets["validation"],
data_collator=data_collator,
)
But then I get this error for some reason, which makes no sense:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-35-a5580f1a4d07> in <module>()
3 args=training_args,
4 train_dataset=lm_datasets["train"],
----> 5 eval_dataset=lm_datasets["validation"],
6 data_collator=data_collator,
7 )
/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py in __getitem__(self, k)
35 def __getitem__(self, k) -> Dataset:
36 if isinstance(k, (str, NamedSplit)) or len(self) == 0:
---> 37 return super().__getitem__(k)
38 else:
39 available_suggested_splits = [
KeyError: 'validation'
telavir
September 24, 2022, 5:49pm
2
I’m getting the same error testing:
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator)
Ant it returns this error:
KeyError Traceback (most recent call last)
Cell In [4], line 23
22 train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator)
---> 23 eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator)
File C:\Python\Python310\lib\site-packages\datasets\dataset_dict.py:51, in DatasetDict.__getitem__(self, k)
49 def __getitem__(self, k) -> Dataset:
50 if isinstance(k, (str, NamedSplit)) or len(self) == 0:
---> 51 return super().__getitem__(k)
52 else:
53 available_suggested_splits = [
54 split for split in (Split.TRAIN, Split.TEST, Split.VALIDATION) if split in self
55 ]
KeyError: 'validation'
Hi! It looks like validation
is not one of the keys in your dataset dict. What does tokenized_datasets.keys()
print?
telavir
January 25, 2023, 11:11pm
4
Sorry I never responded. I never fixed this. I started over with my code and a different direction and the error went away. Here are the two things I changed:
I was using multiple libraries to wrangle the dataset. I got rid of the others and only used dataset.
I stopped using anything that referred to validation and used train & test only.