Urgent! Weird behavior of CLIPTokenizer when encoding out of vocabulary /non-English text with openai/clip-vit-base-patch32, and question about merges.txt


The easiest way to reproduce or see the behavior is to open:this colab notebook

Expected behavior

I’m dealing with a text dataset that is multilingual and wish to tell which sentence is non-English by counting the percentage of unk_tokens after tokenizing.

Based on my understanding, tokenizer.encode(string) is equivalent to tokenizer.convert_tokens_to_ids(tokenizer.tokenize(string)) and should map tokens that are not in the vocab to tokenizer.unk_token. Also spaces are ignored during encoding and decoding will add spaces between tokens.

However this is not the case, tokenize() seems to map them to some token in merges.txt (https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/merges.txt) and then map them into values according to vocab.json.
Sometimes this seems to separate the text further into sub-tokens according to merges.txt and thus add spaces between them when decoding.

This behavior is very annoying because it treats non-English text and English text in the same way, decodes the sentence back to its original form, but randomly adds spaces inside these non-English words.

Thanks very much for your patience and help.