BPEDecoder no spaces after special tokens

I have a custom BPE tokenizer, with a BPEDecoder (to fix additional spaces in the decoded output) but my decoded outputs have no spaces after special tokens.
I train the tokenizer as follows:

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.decoder = decoders.BPEDecoder()

optional_specials = ["[BL]", "[BT1]", "[BT2]", "[BT3]", "[BT4]", "[BT5]", "[BT6]", "[BT7]", "[BT8]", "[BT9]", "[BT10]", "[BT11]", "[BT12]"]
special_tokens = ["<s>", "</s>", "<unk>", "<pad>", "<mask>"] + optional_specials

trainer = BpeTrainer(special_tokens=special_tokens, end_of_word_suffix="</w>", vocab_size=4000)
tokenizer.pre_tokenizer = Whitespace()

tokenizer.save(f'./tokenizer/{filter}_bpe_tokenizer/tokenizer.json')

(I was originally using BERT-like special tokens ("[CLS]", “[SEP]”, etc.), but switched to the GPT-like ones as a test. Either way the spaces are missing.)

I load to a Transformers tokenizer using PreTrainedTokenizerFast:

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"./tokenizer/{filter}_bpe_tokenizer/tokenizer.json")

When I try to decode:

decode_test = fast_tokenizer.decode(test)

the result is correct except that there are no spaces after any of my optional_specials, e.g.:

... [BL]MD3a6 [BT1]CCHHr6 DM100 ST27 ST34 ST36 ...

instead of:

... [BL] MD3a6 [BT1] CCHHr6 DM100 ST27 ST34 ST36 ...

This looks like maybe it’s intending to remove special tokens in the decoded version, but I’m not sure. How do I fix this? I want the optional_specials in the output because I have another process that uses them to segment the result.

I’ve worked around it by removing the Tokenizers decoder and instead using:

tokenizer = Tokenizer(BPE(unk_token="[UNK]", continuing_subword_prefix="##"))

That seems to be giving me what I’d expect. Maybe there’s something funky about the way the Tokenizers and Transformers libs deal with decoders… Anyway, this seems like it will work for what I need.

but… I’m noticing that the saved tokenizer.json shows "continuing_subword_prefix": null,:frowning:

no idea

If this doesn’t work to avoid having a bunch of white space in my output words, then I’ll go back to the decoder-based version and just post-process the lack of spaces between specials and subsequent words.

I’m noticing too that tokenizer.train_from_iterator seems to erase the continuing_subword_prefix setting. Did you ever solve this?

For the HF devs, to replicate:

wiki103 = datasets.load_dataset(
    path="wikitext",
    name="wikitext-103-raw-v1",
    split="train",
)

tokenizer = Tokenizer(
    model=BPE(
        unk_token="[UNK]",
        continuing_subword_prefix="##",
    )
)
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.pre_tokenizer = Whitespace()

assert tokenizer.model.continuing_subword_prefix == "##"  # Fine
tokenizer.train_from_iterator(
    iterator=wiki103["text"],
    trainer=trainer,
)
assert tokenizer.model.continuing_subword_prefix == "##"  # Assertion Error

having the same issue after tokenizer.train_from_iterator().