Emojis poisoning tokenizer

Dts1 · June 17, 2024, 2:09am

I have a problem with emojis “poisoning” tokenizers and I wonder if there is an existing solution to deal with this problem. So far, I was not able to find it by doing different web searches. I have some ideas how to solve it by myself but I’d like to know if somebody already dealt with it.

The problem is the following: is an emoji is not separated from a word, then the whole word is marked as UNK token. Example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(“sentence-transformers/distiluse-base-multilingual-cased-v1”)
text=“sky”
print(tokenizer(text).input_ids)
text=“sky🙂”
print(tokenizer(text).input_ids)
text=“sky”
print(tokenizer(text).input_ids)

Output:
[101, 62368, 102]
[101, 100, 102]
[101, 100, 102]

Topic		Replies	Views
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
Why does my MLM model still not output emojis after adding them as special tokens? Beginners	0	422	June 29, 2021
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	663	April 22, 2024
"Add_tokens" breaks words when encoding 🤗Tokenizers	2	1276	August 22, 2023
Tokenizer not recognising words in vocabulary 🤗Tokenizers	4	1858	March 5, 2024

Emojis poisoning tokenizer

Related topics