The Huggingface tokenizer documents say to use the following:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
However, it looks like the correct way to train a byte-level BPE is as follows:
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(["path/to/train.txt"], vocab_size=1000, min_frequency=2, special_tokens=[
Why is the ByteLevelBPETokenizer
not just a normal tokenizer model?