Hi. I have trained a BPE Tokenizer successfully using:
tokenizer = Tokenizer(BPE(unk_token=unk_token))
trainer = BpeTrainer(special_tokens=special_tokens)
But I lack things like
tokenizer.pad_token to access the padding token or
len(tokenizer) to get the vocabulary size. Is this the best way to train a BPE tokenizer? How can I get the same API as PretrainedTokenizer? Thanks in advance for any help you can provide.
Not pretty sure what questions you met, so I just explain how to deal with padding and get vocab size here:
- Padding: the reason we set
[PAD] in training phase is to tell tokenizer that we might have a special token there, but we do not tell the function about each special token at moment. Then after the training you finally get a trained tokenizer instance, then you can specify
tokenizer.enable_padding(pad_id=tokenizer.tokens_to_id('[PAD]')) to tell the tokenizer that we want to set
[PAD] to the roll of the padding token which we have already setup in vocab list before. After setting up
.enable_padding(), the tokenized sentense might contain padding.
- Vocab size: