The easiest way to reproduce or see the behavior is to open:this colab notebook
I’m dealing with a text dataset that is multilingual and wish to tell which sentence is non-English by counting the percentage of unk_tokens after tokenizing.
Based on my understanding, tokenizer.encode(string) is equivalent to tokenizer.convert_tokens_to_ids(tokenizer.tokenize(string)) and should map tokens that are not in the vocab to tokenizer.unk_token. Also spaces are ignored during encoding and decoding will add spaces between tokens.
However this is not the case, tokenize() seems to map them to some token in merges.txt (https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/merges.txt) and then map them into values according to vocab.json.
Sometimes this seems to separate the text further into sub-tokens according to merges.txt and thus add spaces between them when decoding.
This behavior is very annoying because it treats non-English text and English text in the same way, decodes the sentence back to its original form, but randomly adds spaces inside these non-English words.
Thanks very much for your patience and help.