Tokenizer mapping the same token to multiple token_ids

Why do tokenizers sometimes map the same text to different tokens?

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('TeamUNIVA/Komodo_7B_v1.0.0')

short_text = "tokenize" 
long_text = "tokenizer.tokenize() vs tokenizer()"
encoded_short = tokenizer.encode(short_text, add_special_tokens=False)
encoded_long = tokenizer.encode(long_text, add_special_tokens=False)

print("===== SHORT TEXT =====")
for token_id in encoded_short:
    print(f"{str(token_id)} : '{tokenizer.decode(token_id)}'")

print("===== LONG TEXT =====")
for token_id in encoded_long:
    print(f"{str(token_id)} : '{tokenizer.decode(token_id)}'")

If you run the code Iā€™ve provided above, youā€™ll see the following result:

image

What confuses me about this result is the fact that the same word (ā€œtokenizeā€ in this case), is tokenized differently at different points of the text. Specifically, the token ā€œtokenā€ seems to have two corresponding token ids.

image

Here are some questions I have regarding the above:

  1. Is it normal that the same token is mapped to multiple token_ids?
  2. In the case that it IS normal, is there any way to know which token_id the token will be mapped to without having to encode the text?
  3. In the case that it IS NOT normal, would you recommend that I use a different model?

The problem is that Iā€™m trying to create ā€œlabelsā€ for a particular dataset by masking up to a specific index of the ā€œinput_idsā€ returned by the tokenizer. The code I am working on right now determines that specific index by searching for a specific sequence of tokens (the token_ids that correspond to a specific part of my input prompt). If the tokenizer I am using is not consistent in the way that it encodes the same tokens, it would probably lead to failures with this particular approach.

I would appreciate any guidance, and thank you for reading!

Hello,

Iā€™m hoping to re-raise the question because I havenā€™t made any progress towards figuring this particular phenomenon out.

The HuggingFace guide on tokenizers seems to imply that tokenizers are expected to be consistent, which is not what I am experiencing.


Summary of the tokenizers by HuggingFace

If anyone even knows whether the phenomenon pointed out in the initial post is normal or abnormal, please let me know!

Hey! This is because of the special token used by SentencePiece to mark the whitespace or the beginning of the word. You can check here for more about how sentence piece algorithm works

In the case of ā€œtokenā€, those are actually two different tokens in str format and you can see that by converting ids to tokens. When you used the .decode() method it drops underlines.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('TeamUNIVA/Komodo_7B_v1.0.0')
>>> tokenizer.convert_ids_to_tokens(6029)
'ā–token'
>>> tokenizer.convert_ids_to_tokens(5263)
'token'

Thank you for that! I hadnā€™t taken that into consideration but that makes a lot of sense. I assume that is also part of why the same word may be tokenized differently at different points of a text.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.