Hello,
I am following this tutorial here: notebooks/tokenizer_training.ipynb at master · huggingface/notebooks · GitHub
So, using this code, I add my custom dataset:
from datasets import load_dataset
dataset = load_dataset('csv', data_files=['/content/drive/MyDrive/mydata.csv'])
Then, I use this code to take a look at the dataset:
dataset
Access an element:
dataset['train'][1]
Access a slice directory:
dataset['train'][:5]
After executing the above code successfully, I try to execute this here:
new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)
However, I get this error:
KeyError: "Invalid key: slice(0, 1000, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(0, 1000, None)]`. Available splits: ['train']"
How do I fix this?
I am trying to train my own tokenizer, and this seems to be an issue.
Any help would be appreciated!