Why do tokenizers sometimes map the same text to different tokens?
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('TeamUNIVA/Komodo_7B_v1.0.0')
short_text = "tokenize"
long_text = "tokenizer.tokenize() vs tokenizer()"
encoded_short = tokenizer.encode(short_text, add_special_tokens=False)
encoded_long = tokenizer.encode(long_text, add_special_tokens=False)
print("===== SHORT TEXT =====")
for token_id in encoded_short:
print(f"{str(token_id)} : '{tokenizer.decode(token_id)}'")
print("===== LONG TEXT =====")
for token_id in encoded_long:
print(f"{str(token_id)} : '{tokenizer.decode(token_id)}'")
If you run the code Iāve provided above, youāll see the following result:
What confuses me about this result is the fact that the same word (ātokenizeā in this case), is tokenized differently at different points of the text. Specifically, the token ātokenā seems to have two corresponding token ids.
Here are some questions I have regarding the above:
- Is it normal that the same token is mapped to multiple token_ids?
- In the case that it IS normal, is there any way to know which token_id the token will be mapped to without having to encode the text?
- In the case that it IS NOT normal, would you recommend that I use a different model?
The problem is that Iām trying to create ālabelsā for a particular dataset by masking up to a specific index of the āinput_idsā returned by the tokenizer. The code I am working on right now determines that specific index by searching for a specific sequence of tokens (the token_ids that correspond to a specific part of my input prompt). If the tokenizer I am using is not consistent in the way that it encodes the same tokens, it would probably lead to failures with this particular approach.
I would appreciate any guidance, and thank you for reading!