Hi I would like to train my own ByteLevelBPETokenizer
using an nlp dataset.
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=???, vocab_size=52000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
The dataset is from:
from datasets import load_dataset
dataset = load_dataset('wikicorpus', 'raw_en')
How can I process this dataset to input it in the tokenizer.train()
function?
Thanks
1 Like
lhoestq
2
You can take a look at the example script here:
4 Likes