Tokenizer mapping the same token to multiple token_ids

Chahnwoo · April 18, 2024, 2:56am

Why do tokenizers sometimes map the same text to different tokens?

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('TeamUNIVA/Komodo_7B_v1.0.0')

short_text = "tokenize" 
long_text = "tokenizer.tokenize() vs tokenizer()"
encoded_short = tokenizer.encode(short_text, add_special_tokens=False)
encoded_long = tokenizer.encode(long_text, add_special_tokens=False)

print("===== SHORT TEXT =====")
for token_id in encoded_short:
    print(f"{str(token_id)} : '{tokenizer.decode(token_id)}'")

print("===== LONG TEXT =====")
for token_id in encoded_long:
    print(f"{str(token_id)} : '{tokenizer.decode(token_id)}'")

If you run the code I’ve provided above, you’ll see the following result:

What confuses me about this result is the fact that the same word (“tokenize” in this case), is tokenized differently at different points of the text. Specifically, the token “token” seems to have two corresponding token ids.

Here are some questions I have regarding the above:

Is it normal that the same token is mapped to multiple token_ids?
In the case that it IS normal, is there any way to know which token_id the token will be mapped to without having to encode the text?
In the case that it IS NOT normal, would you recommend that I use a different model?

The problem is that I’m trying to create “labels” for a particular dataset by masking up to a specific index of the “input_ids” returned by the tokenizer. The code I am working on right now determines that specific index by searching for a specific sequence of tokens (the token_ids that correspond to a specific part of my input prompt). If the tokenizer I am using is not consistent in the way that it encodes the same tokens, it would probably lead to failures with this particular approach.

I would appreciate any guidance, and thank you for reading!

Chahnwoo · April 22, 2024, 12:22am

Hello,

I’m hoping to re-raise the question because I haven’t made any progress towards figuring this particular phenomenon out.

The HuggingFace guide on tokenizers seems to imply that tokenizers are expected to be consistent, which is not what I am experiencing.

Summary of the tokenizers by HuggingFace

If anyone even knows whether the phenomenon pointed out in the initial post is normal or abnormal, please let me know!

RaushanTurganbay · April 22, 2024, 7:35am

Hey! This is because of the special token used by SentencePiece to mark the whitespace or the beginning of the word. You can check here for more about how sentence piece algorithm works

In the case of “token”, those are actually two different tokens in str format and you can see that by converting ids to tokens. When you used the .decode() method it drops underlines.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('TeamUNIVA/Komodo_7B_v1.0.0')
>>> tokenizer.convert_ids_to_tokens(6029)
'▁token'
>>> tokenizer.convert_ids_to_tokens(5263)
'token'

Chahnwoo · April 22, 2024, 8:12am

Thank you for that! I hadn’t taken that into consideration but that makes a lot of sense. I assume that is also part of why the same word may be tokenized differently at different points of a text.

system · April 22, 2024, 8:13pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Same sequence maps to different token ids Intermediate	0	367	August 29, 2022
Why each word id from tokenizer distilbert got different even it's the same word ex: "Sebab:2953" & "Sebab:1061"? Beginners	0	352	May 17, 2022
Llama2 tokenizer duplicate ids Beginners	2	1447	April 21, 2024
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	314	May 12, 2024
SentencePieceProcessor encoding differs from AutoTokenizer, how can that be? Beginners	0	867	December 12, 2023

Tokenizer mapping the same token to multiple token_ids

Related topics