KeyError: "Invalid key: slice(0, 1000, None). Please first select a split

anon58275033 · August 9, 2021, 1:11pm

Hi,

I am trying to train a tokenizer and execute the following line of code:

new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

Though, when I execute it, I get this error:

KeyError: "Invalid key: slice(0, 1000, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(0, 1000, None)]`. Available splits: ['train']"

lhoestq · August 16, 2021, 8:31am

Hi ! Your batch_iterator must be iterating on a Dataset object, however it looks like you try to iterate over a DatasetDict (it maps split names to Dataset objects).
To fix your code, you just have to replace dataset by dataset["train"] in your definition of batch_iterator.

Let me know if that works or if if you have other questions

anon58275033 · August 16, 2021, 9:24am

Hi, thanks for that - it works! I am now coming across this error, sadly:

hassanjbara · September 11, 2023, 6:13pm

But doesn’t that the trainer is only going to use the train dataset, because the iterator is only returning the train split?

Topic		Replies	Views
Unable to use custom dataset when training a tokenizer Beginners	2	364	August 11, 2021
KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.' Beginners	7	988	June 3, 2024
Tokenizer from scratch Error TypeError: Can't convert None to PyString Beginners	1	1083	December 26, 2022
Tokenizer train_new_from_iterator hanging for several models 🤗Transformers	0	153	March 16, 2024
Getting error - trainer.train() 🤗Transformers	4	985	June 3, 2024

KeyError: "Invalid key: slice(0, 1000, None). Please first select a split

Related topics