Tokenized sequence lengths

Digging in further, it looks like the difference must be between BPE and ByteLevelBPETokenizer (i.e., RoBERTa’s tokenizer). With the former, I get the 4000 item vocab I want, but the latter only gives me a 1300 item vocab (despite indicating 4000 in the vocab_size).

So to get what I’m after, I have to either;

  1. figure out how to get the BPE version into a tokenizer that plays nice with transformers OR
  2. figure out how to get the ByteLevelBPETokenizer to learn a 4000 item vocab