Unable to use custom dataset when training a tokenizer

Hello,

I am following this tutorial here: notebooks/tokenizer_training.ipynb at master · huggingface/notebooks · GitHub

So, using this code, I add my custom dataset:

from datasets import load_dataset
dataset = load_dataset('csv', data_files=['/content/drive/MyDrive/mydata.csv'])

Then, I use this code to take a look at the dataset:

dataset

Access an element:

dataset['train'][1]

Access a slice directory:

dataset['train'][:5]

After executing the above code successfully, I try to execute this here:

new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

However, I get this error:

KeyError: "Invalid key: slice(0, 1000, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(0, 1000, None)]`. Available splits: ['train']"

How do I fix this?

I am trying to train my own tokenizer, and this seems to be an issue.

Any help would be appreciated!

When asking for help on the forum, please paste all relevant code. In this case, you did not past the definition of batch_iterator.

If you are following the notebook, you did not load one dataset, but several (with a split for train/validation/test) which is why you get this error. You should add the split="train" argument when you load your dataset, or adapt the code of batch_iterator to index in your dataset dictionary.

1 Like

Okay, thanks for that. I have trained my own tokenizer from scratch, so how do I use it in the masked language task?