I have a problem with emojis “poisoning” tokenizers and I wonder if there is an existing solution to deal with this problem. So far, I was not able to find it by doing different web searches. I have some ideas how to solve it by myself but I’d like to know if somebody already dealt with it.
The problem is the following: is an emoji is not separated from a word, then the whole word is marked as UNK token. Example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(“sentence-transformers/distiluse-base-multilingual-cased-v1”)
text=“sky”
print(tokenizer(text).input_ids)
text=“sky🙂”
print(tokenizer(text).input_ids)
text=“sky”
print(tokenizer(text).input_ids)
Output:
[101, 62368, 102]
[101, 100, 102]
[101, 100, 102]