Tokenized sequence lengths

btw, Huggingface people, I’m still wondering if there’s any way to force a larger vocabulary during training? Presumably this would just be more “merging”, no? Shouldn’t there be a parameter to force a larger vocab if you want it?

EDIT: I notice I was apparently getting the 4000 word vocab when I posted this, but that’s not the case now… I request vocab_size=4000 and I get 2026. Hmm…