Slow Tokenizer adds whitespace after special token

@ArthurZ
Sorry I misssed some information, but I’m still confused after reading your reply at Slow Tokenizer adds whitespace after special token · Issue #25073 · huggingface/transformers · GitHub.
As mentioned in https://github.com/huggingface/transformers/pull/24565#issuecomment-1656719744
How could a word that comes after a special token be considered part of that special token?

Suppose we add a special token <bot> into LlamaTokenizer.

    txt = "<bot>" + "How are you" 
    tokenizer1 = LlamaTokenizer.from_pretrained(
        "./resources/models/llama-2-7b-hf", legacy=True, use_fast=False
    )
    tokenizer2 = LlamaTokenizer.from_pretrained(
        "./resources/models/llama-2-7b-hf", legacy=False, use_fast=False
    )

    t1 = tokenizer1.tokenize(txt)
    t2 = tokenizer2.tokenize(txt)

The result is as follws:

t1:[ '<bot>','▁How', '▁are', '▁you']
t2:[ '<bot>', 'How', '▁are', '▁you']

If we consider <bot> and How to be part of the same word, tokenizer1 performs correctly. However, if we consider them to be two different words, then tokenizer2 would be better. Although the original text is <bot>How, I don’t think it’s a legitimate word. Hence, tokenizer2 seems to correct it."

If there are any inaccuracies about my points, please feel free to correct them.

Hey! Glad you pinged me here :wink: !
So I totally agree with you, they are different words. I don’t know why your question implies that I meant that a word should be part of a special token, but no indeed it is not.

Even if we consider <bot> and How to be part of the same word, [ '<bot>','▁How'] is still wrong. The '▁' token is not there to split words, it’s a space. So for example '▁are' is a different token from are. '▁are' is a beginning of a word, ( meaning there is always a space before it) while 'are' is in a word like 'care' or 'aware' etc (depending on the merges and vocab).

1 Like

In the discussion related to my comment in #25073, we’ve observed distinct behaviors in the case of words like <bot>traditionally when setting legacy to different values.

# when setting it to True
['<bot>', '▁tradition', 'ally']
# when setting it to False
['<bot>', 'tradition', 'ally']

The setting True adds an unintended space between <bot> and tradition , possibly due to a mistake or bug. However, it does correct the spelling as these two parts are not typically perceived as a single word. It’s likely a typographical error, and the intended input should be <bot> traditionally rather than <bot>traditionally . I initially thought of this as a peculiar “feature” that didn’t necessitate any changes.
However, upon further consideration, encoding text such as <bot>.. should ideally tokenize to ['<bot>', '.']. Without any modification, it could result in ['<bot>', '▁.'].

In conclusion, it’s crucial and worthy of our attention and effort.

Thanks again for your quickly reply and great work(about this PR).

Could you reming me why you need legacy=True?
Because

# when setting it to False
['<bot>', 'tradition', 'ally']

seems correct no?
If you have an input that is <bot> Hey and you want to force this to encode to <bot>Hey either removing prefix space, or setting rstrip to True could fix this no?

When setting legacy=True: If I forgot to add a space between <bot> and traditionally , It can be fixed by adding an “extra” space between them.

['<bot>', '▁tradition', 'ally']

The premise here is that the token after the special token is treated as an individual token, even though we don’t add a space before it.