NLP dataset for ByteLevelTokenizer Training

Hi I would like to train my own ByteLevelBPETokenizer using an nlp dataset.

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=???, vocab_size=52000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

The dataset is from:

from datasets import load_dataset
dataset = load_dataset('wikicorpus', 'raw_en')

How can I process this dataset to input it in the tokenizer.train() function?

Thanks

1 Like

You can take a look at the example script here:

4 Likes