Hi, I trained a tokenizers. Tokens contain spaces as well. When I decode the decode method add space between tokens and it makes it wrong I need to avoid them. How to do that?
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer")
test_text = """
Ez ĂȘ di vĂȘ gotarĂȘ da qala ĂȘn ku ez guhdar Ă» temaĆe dikim bikim
"""
tokens = tokenizer.tokenize(test_text)
print(f"Tokens: {tokens}")
# Tokens: ['\n', 'Ez ĂȘ ', 'di vĂȘ ', 'got', 'arĂȘ ', 'da ', 'qala ', 'ĂȘn ku ', 'ez ', 'guh', 'dar Ă» ', 'temaĆe ', 'dikim ', 'bikim', '\n']
ids = tokenizer.encode(test_text)
print(f"IDs: {ids}")
# IDs: [6, 6271, 1323, 452, 462, 396, 2409, 566, 654, 1204, 3278, 4543, 7880, 7595, 6]
text = tokenizer.decode(ids)
print(f"text: {text}")
# text:
# Ez ĂȘ di vĂȘ got arĂȘ da qala ĂȘn ku ez guh dar Ă» temaĆe dikim bikim
As you can see it add extra space between tokens when decoding. I know I can make something like below but I am curious if transformer support something like this built-in
individual_tokens = [tokenizer.decode([id]) for id in ids]
"".join(individual_tokens)