BPEDecoder no spaces after special tokens

jbmaxwell · April 5, 2022, 9:08pm

I have a custom BPE tokenizer, with a BPEDecoder (to fix additional spaces in the decoded output) but my decoded outputs have no spaces after special tokens.
I train the tokenizer as follows:

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.decoder = decoders.BPEDecoder()

optional_specials = ["[BL]", "[BT1]", "[BT2]", "[BT3]", "[BT4]", "[BT5]", "[BT6]", "[BT7]", "[BT8]", "[BT9]", "[BT10]", "[BT11]", "[BT12]"]
special_tokens = ["<s>", "</s>", "<unk>", "<pad>", "<mask>"] + optional_specials

trainer = BpeTrainer(special_tokens=special_tokens, end_of_word_suffix="</w>", vocab_size=4000)
tokenizer.pre_tokenizer = Whitespace()

tokenizer.save(f'./tokenizer/{filter}_bpe_tokenizer/tokenizer.json')

(I was originally using BERT-like special tokens ("[CLS]", “[SEP]”, etc.), but switched to the GPT-like ones as a test. Either way the spaces are missing.)

I load to a Transformers tokenizer using PreTrainedTokenizerFast:

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"./tokenizer/{filter}_bpe_tokenizer/tokenizer.json")

When I try to decode:

decode_test = fast_tokenizer.decode(test)

the result is correct except that there are no spaces after any of my optional_specials, e.g.:

... [BL]MD3a6 [BT1]CCHHr6 DM100 ST27 ST34 ST36 ...

instead of:

... [BL] MD3a6 [BT1] CCHHr6 DM100 ST27 ST34 ST36 ...

This looks like maybe it’s intending to remove special tokens in the decoded version, but I’m not sure. How do I fix this? I want the optional_specials in the output because I have another process that uses them to segment the result.

jbmaxwell · April 5, 2022, 10:58pm

I’ve worked around it by removing the Tokenizers decoder and instead using:

tokenizer = Tokenizer(BPE(unk_token="[UNK]", continuing_subword_prefix="##"))

That seems to be giving me what I’d expect. Maybe there’s something funky about the way the Tokenizers and Transformers libs deal with decoders… Anyway, this seems like it will work for what I need.

jbmaxwell · April 5, 2022, 11:09pm

but… I’m noticing that the saved tokenizer.json shows "continuing_subword_prefix": null,…

no idea

If this doesn’t work to avoid having a bunch of white space in my output words, then I’ll go back to the decoder-based version and just post-process the lack of spaces between specials and subsequent words.

Davidg707 · January 2, 2023, 4:22am

I’m noticing too that tokenizer.train_from_iterator seems to erase the continuing_subword_prefix setting. Did you ever solve this?

For the HF devs, to replicate:

wiki103 = datasets.load_dataset(
    path="wikitext",
    name="wikitext-103-raw-v1",
    split="train",
)

tokenizer = Tokenizer(
    model=BPE(
        unk_token="[UNK]",
        continuing_subword_prefix="##",
    )
)
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.pre_tokenizer = Whitespace()

assert tokenizer.model.continuing_subword_prefix == "##"  # Fine
tokenizer.train_from_iterator(
    iterator=wiki103["text"],
    trainer=trainer,
)
assert tokenizer.model.continuing_subword_prefix == "##"  # Assertion Error

keunwoochoi · April 19, 2023, 5:08am

having the same issue after tokenizer.train_from_iterator().

Topic		Replies	Views
How to reconstruct a sentence after it is encoded using BPE? Beginners	2	810	April 18, 2023
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	25	April 22, 2025
How to decode with spaces? 🤗Tokenizers	0	1846	April 28, 2022
Unmasking adds an extra whitespace for BPE tokenizer 🤗Tokenizers	0	270	January 14, 2024
BPE tokenizers and spaces before words 🤗Transformers	4	25974	September 8, 2023

BPEDecoder no spaces after special tokens

Related topics