How to avoid PreTrainedTokenizerFast.decode to add space between tokens

muzaffercky · April 21, 2025, 2:39pm

Hi, I trained a tokenizers. Tokens contain spaces as well. When I decode the decode method add space between tokens and it makes it wrong I need to avoid them. How to do that?

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer")

test_text = """
Ez ê di vê gotarê da qala ên ku ez guhdar û temaşe dikim bikim
"""

tokens = tokenizer.tokenize(test_text)

print(f"Tokens: {tokens}")
# Tokens: ['\n', 'Ez ê ', 'di vê ', 'got', 'arê ', 'da ', 'qala ', 'ên ku ', 'ez ', 'guh', 'dar û ', 'temaşe ', 'dikim ', 'bikim', '\n']


ids = tokenizer.encode(test_text)
print(f"IDs: {ids}")

# IDs: [6, 6271, 1323, 452, 462, 396, 2409, 566, 654, 1204, 3278, 4543, 7880, 7595, 6]

text = tokenizer.decode(ids)

print(f"text: {text}")
# text: 
# Ez ê  di vê  got arê  da  qala  ên ku  ez  guh dar û  temaşe  dikim  bikim

As you can see it add extra space between tokens when decoding. I know I can make something like below but I am curious if transformer support something like this built-in

individual_tokens = [tokenizer.decode([id]) for id in ids]

"".join(individual_tokens)

John6666 · April 22, 2025, 5:55am

Hmm… clean_up_tokenization_spaces?

muzaffercky · April 22, 2025, 7:49am

I do not understand what clean_up_tokenization_spaces does but it does not prevent adding space between tokens.

John6666 · April 22, 2025, 8:51am

Hmm… need add_prefix_spaces or is PreTokenizerFast buggy?

github.com/huggingface/transformers

Special token handling breaks idempotency of sentencepiece due to extra spaces

opened 05:38PM - 09 May 24 UTC

closed 08:11AM - 10 Nov 24 UTC

cat-state

Core: Tokenization

Sentenpiece tokenizers have the property that [`Decode(Encode(Normalize(input)))… == Normalize(input).`](https://github.com/google/sentencepiece/blob/master/doc/api.md#detokenize-text-postprocessing). This property is very useful when combining and re-inferring prompts. However, when used through `tokenizers` with special tokens added for BOS/EOS etc, `tokenizers` will inject an extra space around special tokens when decoding - i.e, `<s>A` will become `<s> A`, which when encoded and decoded will become `<s> A`, `<s> A`, etc. A previous issue was raised about this but incorrectly closed as intended behavior/unfixable: https://github.com/huggingface/tokenizers/issues/1237 . Although not all tokenizers have this property, sentencepiece is very widely used now due to llama and mistral so it would make sense for this behavior to be preserved. There could be two fixes for this: either not add the extra space, or tokenize `<s> A` the same as `<s>A` (i think could be accomplished by changing the `AddedToken` params for these tokens.

github.com/huggingface/tokenizers

Set `add_prefix_space = False` for existing pre-trained tokenizers

opened 04:39AM - 22 Aug 22 UTC

closed 01:47AM - 08 Feb 24 UTC

cyk1337

Stale

I would like to add special tokens into an existing (pre-trained) tokenizer, in …which the added tokens are not whitespace-separated between tokens. Therefore, the decoded string contains additional whitespace ahead of the word start position, due to the configuration of `add_prefix_space = True` I guess. How to disable the feature (`add_prefix_space`) to remove the prefix space between words? For instance, after adding a special token `<special:0>` into the spm (Unigram LM) tokenizer, it can split the text `"<special:0>word1 word2"` into `[<special:0>, "▁word1", "▁word2"]`. After decoding, the resultant sequence would be `"<special:0> word1 word2"` with an additional space between `"<special:0>"` and `"word1"`. Any solutions to handle it except post-processing? @Narsil

Topic		Replies	Views
What does the parameter 'clean_up_tokenization_spaces' do in the tokenizer.decode function? Beginners	2	9091	July 8, 2025
How to decode with spaces? 🤗Tokenizers	0	1866	April 28, 2022
How to make tokenizer add the spaces correctly when decoding a sequence when set add_prefix_space=False 🤗Tokenizers	0	570	October 9, 2023
BPEDecoder no spaces after special tokens Intermediate	4	2054	April 19, 2023
Tokenizer: what function removes spaces between '<' and '>'? 🤗Tokenizers	0	51	December 9, 2024

How to avoid PreTrainedTokenizerFast.decode to add space between tokens

Related topics