Added Tokens Not Decoding with Spaces

mcrchopra · October 19, 2021, 3:21am

Hi All,

My goal is to add a set of starting tokens to a pre-trained AlbertTokenizerFast.

In the Albert Pre-Trained Vocab (SentencePiece Model), all start tokens are preceded with the meta-symbol: ▁ (e.g. ▁hamburger).

I tried adding tokens, prefixed with the meta symbol:

new_tokens = [AddedToken("▁hamburger",), AddedToken("▁pizza")]
num_added_tokens = tokenizer.add_tokens(new_tokens)

However, as this forum post shows, input text to AddedToken is treated literally; so manually adding the meta-symbol prefixes doesn’t achieve the desired effect.

Instead, I tried using the single_word parameter:

new_tokens = [AddedToken("hamburger", single_word=True, lstrip=True), AddedToken("pizza", single_word=True, lstrip=True)]
num_added_tokens = tokenizer.add_tokens(new_tokens)

This solution successfully encodes the new tokens where hamburger is being encoded by token 30001:

tokenizer('This hamburger tastes great') 
>> [2, 15, 30001, 53, 8, 345,3]

However, when I try to decode these ids, no space appears between “this” and “hamburger”:

tokenizer.decode([2, 15, 30001, 53, 8, 345,3]) 
>> ('Thishamburger tastes great')

I was wondering if anybody had any thoughts about how to fix this.

nielsr · October 19, 2021, 12:07pm

Does the same occur when setting lstrip=False when defining the new tokens?

mcrchopra · October 19, 2021, 4:19pm

Thank you for the response!

Yup, if I set lstrip=False, I see the same behavior:

tokenizer('This hamburger tastes great') 
>> [2, 15, 30001, 53, 8, 345,3]

tokenizer.decode([2, 15, 30001, 53, 8, 345,3]) 
>> ('Thishamburger tastes great')

Digging through the code, my hypothesis is:

The pre-tokenizer replaces the meta-symbol with a space character, when applied to raw text.
The decoder reverses the effects of the pre-tokenizer.
When the decoder sees a token with a meta-symbol; it inserts a space (for correct viewing)
Since the AddedTokens don’t go through the same pipeline (i.e. no metasymbol is added); I’m not sure if the pre-tokenizer is applied / if decoding works as expected.

Any thoughts on what could be going wrong? Or how one might approach this?

louisowen6 · January 19, 2024, 5:49am

Hi @mcrchopra , I’m also facing the same issue. Are you able to solve this issue last time?

Topic		Replies	Views
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	52	April 22, 2025
How to make tokenizer add the spaces correctly when decoding a sequence when set add_prefix_space=False 🤗Tokenizers	0	571	October 9, 2023
Adding token to t5-base vocab does not respect space 🤗Tokenizers	0	730	January 13, 2022
Tokenizer vs. TokenizerFast 🤗Transformers	5	6881	August 12, 2021
Llama2 tokenizer duplicate ids Beginners	2	1441	April 21, 2024

Added Tokens Not Decoding with Spaces

Related topics