Slow Tokenizer adds whitespace after special token

@ArthurZ
Sorry I misssed some information, but I鈥檓 still confused after reading your reply at Slow Tokenizer adds whitespace after special token 路 Issue #25073 路 huggingface/transformers 路 GitHub.
As mentioned in https://github.com/huggingface/transformers/pull/24565#issuecomment-1656719744
How could a word that comes after a special token be considered part of that special token?

Suppose we add a special token <bot> into LlamaTokenizer.

    txt = "<bot>" + "How are you" 
    tokenizer1 = LlamaTokenizer.from_pretrained(
        "./resources/models/llama-2-7b-hf", legacy=True, use_fast=False
    )
    tokenizer2 = LlamaTokenizer.from_pretrained(
        "./resources/models/llama-2-7b-hf", legacy=False, use_fast=False
    )

    t1 = tokenizer1.tokenize(txt)
    t2 = tokenizer2.tokenize(txt)

The result is as follws:

t1:[ '<bot>','鈻丠ow', '鈻乤re', '鈻亂ou']
t2:[ '<bot>', 'How', '鈻乤re', '鈻亂ou']

If we consider <bot> and How to be part of the same word, tokenizer1 performs correctly. However, if we consider them to be two different words, then tokenizer2 would be better. Although the original text is <bot>How, I don鈥檛 think it鈥檚 a legitimate word. Hence, tokenizer2 seems to correct it."

If there are any inaccuracies about my points, please feel free to correct them.

Hey! Glad you pinged me here :wink: !
So I totally agree with you, they are different words. I don鈥檛 know why your question implies that I meant that a word should be part of a special token, but no indeed it is not.

Even if we consider <bot> and How to be part of the same word, [ '<bot>','鈻丠ow'] is still wrong. The '鈻' token is not there to split words, it鈥檚 a space. So for example '鈻乤re' is a different token from are. '鈻乤re' is a beginning of a word, ( meaning there is always a space before it) while 'are' is in a word like 'care' or 'aware' etc (depending on the merges and vocab).

1 Like

In the discussion related to my comment in #25073, we鈥檝e observed distinct behaviors in the case of words like <bot>traditionally when setting legacy to different values.

# when setting it to True
['<bot>', '鈻乼radition', 'ally']
# when setting it to False
['<bot>', 'tradition', 'ally']

The setting True adds an unintended space between <bot> and tradition , possibly due to a mistake or bug. However, it does correct the spelling as these two parts are not typically perceived as a single word. It鈥檚 likely a typographical error, and the intended input should be <bot> traditionally rather than <bot>traditionally . I initially thought of this as a peculiar 鈥渇eature鈥 that didn鈥檛 necessitate any changes.
However, upon further consideration, encoding text such as <bot>.. should ideally tokenize to ['<bot>', '.']. Without any modification, it could result in ['<bot>', '鈻.'].

In conclusion, it鈥檚 crucial and worthy of our attention and effort.

Thanks again for your quickly reply and great work(about this PR).

Could you reming me why you need legacy=True?
Because

# when setting it to False
['<bot>', 'tradition', 'ally']

seems correct no?
If you have an input that is <bot> Hey and you want to force this to encode to <bot>Hey either removing prefix space, or setting rstrip to True could fix this no?

When setting legacy=True: If I forgot to add a space between <bot> and traditionally , It can be fixed by adding an 鈥渆xtra鈥 space between them.

['<bot>', '鈻乼radition', 'ally']

The premise here is that the token after the special token is treated as an individual token, even though we don鈥檛 add a space before it.