Tokenizer mapping the same token to multiple token_ids

Hey! This is because of the special token used by SentencePiece to mark the whitespace or the beginning of the word. You can check here for more about how sentence piece algorithm works

In the case of ā€œtokenā€, those are actually two different tokens in str format and you can see that by converting ids to tokens. When you used the .decode() method it drops underlines.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('TeamUNIVA/Komodo_7B_v1.0.0')
>>> tokenizer.convert_ids_to_tokens(6029)
'▁token'
>>> tokenizer.convert_ids_to_tokens(5263)
'token'