If we consider <bot> and How to be part of the same word, tokenizer1 performs correctly. However, if we consider them to be two different words, then tokenizer2 would be better. Although the original text is <bot>How, I don’t think it’s a legitimate word. Hence, tokenizer2 seems to correct it."
If there are any inaccuracies about my points, please feel free to correct them.
Hey! Glad you pinged me here !
So I totally agree with you, they are different words. I don’t know why your question implies that I meant that a word should be part of a special token, but no indeed it is not.
Even if we consider <bot> and How to be part of the same word, [ '<bot>','▁How'] is still wrong. The '▁' token is not there to split words, it’s a space. So for example '▁are' is a different token from are. '▁are' is a beginning of a word, ( meaning there is always a space before it) while 'are' is in a word like 'care' or 'aware' etc (depending on the merges and vocab).
In the discussion related to my comment in #25073, we’ve observed distinct behaviors in the case of words like <bot>traditionally when setting legacy to different values.
# when setting it to True
['<bot>', '▁tradition', 'ally']
# when setting it to False
['<bot>', 'tradition', 'ally']
The setting True adds an unintended space between <bot> and tradition , possibly due to a mistake or bug. However, it does correct the spelling as these two parts are not typically perceived as a single word. It’s likely a typographical error, and the intended input should be <bot> traditionally rather than <bot>traditionally . I initially thought of this as a peculiar “feature” that didn’t necessitate any changes.
However, upon further consideration, encoding text such as <bot>.. should ideally tokenize to ['<bot>', '.']. Without any modification, it could result in ['<bot>', '▁.'].
In conclusion, it’s crucial and worthy of our attention and effort.
Thanks again for your quickly reply and great work(about this PR).
Could you reming me why you need legacy=True?
Because
# when setting it to False
['<bot>', 'tradition', 'ally']
seems correct no?
If you have an input that is <bot> Hey and you want to force this to encode to <bot>Hey either removing prefix space, or setting rstrip to True could fix this no?