The Huggingface tokenizer documents say to use the following:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
However, it looks like the correct way to train a byte-level BPE is as follows:
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(["path/to/train.txt"], vocab_size=1000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
Why is the ByteLevelBPETokenizer
not just a normal tokenizer model?