I have a custom BPE tokenizer, with a BPEDecoder (to fix additional spaces in the decoded output) but my decoded outputs have no spaces after special tokens.
I train the tokenizer as follows:
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.decoder = decoders.BPEDecoder()
optional_specials = ["[BL]", "[BT1]", "[BT2]", "[BT3]", "[BT4]", "[BT5]", "[BT6]", "[BT7]", "[BT8]", "[BT9]", "[BT10]", "[BT11]", "[BT12]"]
special_tokens = ["<s>", "</s>", "<unk>", "<pad>", "<mask>"] + optional_specials
trainer = BpeTrainer(special_tokens=special_tokens, end_of_word_suffix="</w>", vocab_size=4000)
tokenizer.pre_tokenizer = Whitespace()
tokenizer.save(f'./tokenizer/{filter}_bpe_tokenizer/tokenizer.json')
(I was originally using BERT-like special tokens ("[CLS]", “[SEP]”, etc.), but switched to the GPT-like ones as a test. Either way the spaces are missing.)
I load to a Transformers tokenizer using PreTrainedTokenizerFast
:
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"./tokenizer/{filter}_bpe_tokenizer/tokenizer.json")
When I try to decode:
decode_test = fast_tokenizer.decode(test)
the result is correct except that there are no spaces after any of my optional_specials
, e.g.:
... [BL]MD3a6 [BT1]CCHHr6 DM100 ST27 ST34 ST36 ...
instead of:
... [BL] MD3a6 [BT1] CCHHr6 DM100 ST27 ST34 ST36 ...
This looks like maybe it’s intending to remove special tokens in the decoded version, but I’m not sure. How do I fix this? I want the optional_specials
in the output because I have another process that uses them to segment the result.