KeyError: "Invalid key: slice(0, 1000, None). Please first select a split

Hi,

I am trying to train a tokenizer and execute the following line of code:

new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

Though, when I execute it, I get this error:

KeyError: "Invalid key: slice(0, 1000, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(0, 1000, None)]`. Available splits: ['train']"

Hi ! Your batch_iterator must be iterating on a Dataset object, however it looks like you try to iterate over a DatasetDict (it maps split names to Dataset objects).
To fix your code, you just have to replace dataset by dataset["train"] in your definition of batch_iterator.

Let me know if that works or if if you have other questions :wink:

2 Likes

Hi, thanks for that - it works! I am now coming across this error, sadly:

But doesn’t that the trainer is only going to use the train dataset, because the iterator is only returning the train split?