Unable to use custom dataset when training a tokenizer

anon58275033 · August 10, 2021, 10:31am

Hello,

I am following this tutorial here: notebooks/tokenizer_training.ipynb at master · huggingface/notebooks · GitHub

So, using this code, I add my custom dataset:

from datasets import load_dataset
dataset = load_dataset('csv', data_files=['/content/drive/MyDrive/mydata.csv'])

Then, I use this code to take a look at the dataset:

dataset

Access an element:

dataset['train'][1]

Access a slice directory:

dataset['train'][:5]

After executing the above code successfully, I try to execute this here:

new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

However, I get this error:

KeyError: "Invalid key: slice(0, 1000, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(0, 1000, None)]`. Available splits: ['train']"

How do I fix this?

I am trying to train my own tokenizer, and this seems to be an issue.

Any help would be appreciated!

sgugger · August 10, 2021, 3:50pm

When asking for help on the forum, please paste all relevant code. In this case, you did not past the definition of batch_iterator.

If you are following the notebook, you did not load one dataset, but several (with a split for train/validation/test) which is why you get this error. You should add the split="train" argument when you load your dataset, or adapt the code of batch_iterator to index in your dataset dictionary.

anon58275033 · August 11, 2021, 5:00pm

Okay, thanks for that. I have trained my own tokenizer from scratch, so how do I use it in the masked language task?

Topic		Replies	Views
KeyError: "Invalid key: slice(0, 1000, None). Please first select a split Beginners	3	3201	September 11, 2023
Issues with Trainer class on custom dataset 🤗Transformers	3	7318	June 14, 2023
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1080	August 19, 2021
KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.' Beginners	7	985	June 3, 2024
Tokenizer from scratch Error TypeError: Can't convert None to PyString Beginners	1	1083	December 26, 2022

Unable to use custom dataset when training a tokenizer

Related topics