The Huggingface tokenizer documents say to use the following:
from tokenizers import Tokenizer from tokenizers.models import BPE tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
However, it looks like the correct way to train a byte-level BPE is as follows:
tokenizer = ByteLevelBPETokenizer() tokenizer.train(["path/to/train.txt"], vocab_size=1000, min_frequency=2, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>", ])
Why is the
ByteLevelBPETokenizer not just a normal tokenizer model?