NLP dataset for ByteLevelTokenizer Training

Hi I would like to train my own ByteLevelBPETokenizer using an nlp dataset.

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=???, vocab_size=52000, min_frequency=2, special_tokens=[

The dataset is from:

from datasets import load_dataset
dataset = load_dataset('wikicorpus', 'raw_en')

How can I process this dataset to input it in the tokenizer.train() function?


1 Like

You can take a look at the example script here: