Tokenizer mapping the same token to multiple token_ids

RaushanTurganbay · April 22, 2024, 7:35am

Hey! This is because of the special token used by SentencePiece to mark the whitespace or the beginning of the word. You can check here for more about how sentence piece algorithm works

In the case of “token”, those are actually two different tokens in str format and you can see that by converting ids to tokens. When you used the .decode() method it drops underlines.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('TeamUNIVA/Komodo_7B_v1.0.0')
>>> tokenizer.convert_ids_to_tokens(6029)
'▁token'
>>> tokenizer.convert_ids_to_tokens(5263)
'token'

Topic		Replies	Views
Same sequence maps to different token ids Intermediate	0	367	August 29, 2022
Why each word id from tokenizer distilbert got different even it's the same word ex: "Sebab:2953" & "Sebab:1061"? Beginners	0	352	May 17, 2022
Llama2 tokenizer duplicate ids Beginners	2	1447	April 21, 2024
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	314	May 12, 2024
SentencePieceProcessor encoding differs from AutoTokenizer, how can that be? Beginners	0	867	December 12, 2023

Tokenizer mapping the same token to multiple token_ids

Related topics