Digging in further, it looks like the difference must be between BPE and ByteLevelBPETokenizer (i.e., RoBERTa’s tokenizer). With the former, I get the 4000 item vocab I want, but the latter only gives me a 1300 item vocab (despite indicating 4000 in the vocab_size
).
So to get what I’m after, I have to either;
- figure out how to get the BPE version into a tokenizer that plays nice with transformers OR
- figure out how to get the ByteLevelBPETokenizer to learn a 4000 item vocab