Added Tokens Not Decoding with Spaces

Hi All,

My goal is to add a set of starting tokens to a pre-trained AlbertTokenizerFast.

In the Albert Pre-Trained Vocab (SentencePiece Model), all start tokens are preceded with the meta-symbol: ▁ (e.g. ▁hamburger).

I tried adding tokens, prefixed with the meta symbol:

new_tokens = [AddedToken("▁hamburger",), AddedToken("▁pizza")]
num_added_tokens = tokenizer.add_tokens(new_tokens)

However, as this forum post shows, input text to AddedToken is treated literally; so manually adding the meta-symbol prefixes doesn’t achieve the desired effect.

Instead, I tried using the single_word parameter:

new_tokens = [AddedToken("hamburger", single_word=True, lstrip=True), AddedToken("pizza", single_word=True, lstrip=True)]
num_added_tokens = tokenizer.add_tokens(new_tokens)

This solution successfully encodes the new tokens where hamburger is being encoded by token 30001:

tokenizer('This hamburger tastes great') 
>> [2, 15, 30001, 53, 8, 345,3]

However, when I try to decode these ids, no space appears between “this” and “hamburger”:

tokenizer.decode([2, 15, 30001, 53, 8, 345,3]) 
>> ('Thishamburger tastes great')

I was wondering if anybody had any thoughts about how to fix this.

Does the same occur when setting lstrip=False when defining the new tokens?

Thank you for the response!

Yup, if I set lstrip=False, I see the same behavior:

tokenizer('This hamburger tastes great') 
>> [2, 15, 30001, 53, 8, 345,3]
tokenizer.decode([2, 15, 30001, 53, 8, 345,3]) 
>> ('Thishamburger tastes great')

Digging through the code, my hypothesis is:

  • The pre-tokenizer replaces the meta-symbol with a space character, when applied to raw text.
  • The decoder reverses the effects of the pre-tokenizer.
  • When the decoder sees a token with a meta-symbol; it inserts a space (for correct viewing)
  • Since the AddedTokens don’t go through the same pipeline (i.e. no metasymbol is added); I’m not sure if the pre-tokenizer is applied / if decoding works as expected.

Any thoughts on what could be going wrong? Or how one might approach this?