Hi. I have trained a BPE Tokenizer successfully using:
tokenizer = Tokenizer(BPE(unk_token=unk_token))
trainer = BpeTrainer(special_tokens=special_tokens)
tokenizer.train(text_files, trainer)
But I lack things like tokenizer.pad_token
to access the padding token or len(tokenizer)
to get the vocabulary size. Is this the best way to train a BPE tokenizer? How can I get the same API as PretrainedTokenizer? Thanks in advance for any help you can provide.
Not pretty sure what questions you met, so I just explain how to deal with padding and get vocab size here:
- Padding: the reason we set
[PAD]
in training phase is to tell tokenizer that we might have a special token there, but we do not tell the function about each special token at moment. Then after the training you finally get a trained tokenizer instance, then you can specify tokenizer.enable_padding(pad_id=tokenizer.tokens_to_id('[PAD]'))
to tell the tokenizer that we want to set [PAD]
to the roll of the padding token which we have already setup in vocab list before. After setting up .enable_padding()
, the tokenized sentense might contain padding.
- Vocab size:
tokenizer.get_vocab_size()