How to avoid PreTrainedTokenizerFast.decode to add space between tokens

Hi, I trained a tokenizers. Tokens contain spaces as well. When I decode the decode method add space between tokens and it makes it wrong I need to avoid them. How to do that?

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer")

test_text = """
Ez ĂȘ di vĂȘ gotarĂȘ da qala ĂȘn ku ez guhdar Ă» temaße dikim bikim
"""

tokens = tokenizer.tokenize(test_text)

print(f"Tokens: {tokens}")
# Tokens: ['\n', 'Ez ĂȘ ', 'di vĂȘ ', 'got', 'arĂȘ ', 'da ', 'qala ', 'ĂȘn ku ', 'ez ', 'guh', 'dar Ă» ', 'temaße ', 'dikim ', 'bikim', '\n']


ids = tokenizer.encode(test_text)
print(f"IDs: {ids}")

# IDs: [6, 6271, 1323, 452, 462, 396, 2409, 566, 654, 1204, 3278, 4543, 7880, 7595, 6]

text = tokenizer.decode(ids)

print(f"text: {text}")
# text: 
# Ez ĂȘ  di vĂȘ  got arĂȘ  da  qala  ĂȘn ku  ez  guh dar Ă»  temaße  dikim  bikim

As you can see it add extra space between tokens when decoding. I know I can make something like below but I am curious if transformer support something like this built-in

individual_tokens = [tokenizer.decode([id]) for id in ids]

"".join(individual_tokens)

1 Like

Hmm
 clean_up_tokenization_spaces?

I do not understand what clean_up_tokenization_spaces does but it does not prevent adding space between tokens.

1 Like

Hmm
 need add_prefix_spaces or is PreTokenizerFast buggy?