Trained tokenizer API as PretrainedTokenizer

Hi. I have trained a BPE Tokenizer successfully using:

 tokenizer = Tokenizer(BPE(unk_token=unk_token))
 trainer = BpeTrainer(special_tokens=special_tokens)
 tokenizer.train(text_files, trainer)

But I lack things like tokenizer.pad_token to access the padding token or len(tokenizer) to get the vocabulary size. Is this the best way to train a BPE tokenizer? How can I get the same API as PretrainedTokenizer? Thanks in advance for any help you can provide.

Not pretty sure what questions you met, so I just explain how to deal with padding and get vocab size here:

  1. Padding: the reason we set [PAD] in training phase is to tell tokenizer that we might have a special token there, but we do not tell the function about each special token at moment. Then after the training you finally get a trained tokenizer instance, then you can specify tokenizer.enable_padding(pad_id=tokenizer.tokens_to_id('[PAD]')) to tell the tokenizer that we want to set [PAD] to the roll of the padding token which we have already setup in vocab list before. After setting up .enable_padding(), the tokenized sentense might contain padding.
  2. Vocab size: tokenizer.get_vocab_size()