Urgent! Weird behavior of CLIPTokenizer when encoding out of vocabulary /non-English text with openai/clip-vit-base-patch32, and question about merges.txt

Edenzzz · November 13, 2022, 7:14pm

Reproduction

The easiest way to reproduce or see the behavior is to open:this colab notebook

Expected behavior

I’m dealing with a text dataset that is multilingual and wish to tell which sentence is non-English by counting the percentage of unk_tokens after tokenizing.

Based on my understanding, tokenizer.encode(string) is equivalent to tokenizer.convert_tokens_to_ids(tokenizer.tokenize(string)) and should map tokens that are not in the vocab to tokenizer.unk_token. Also spaces are ignored during encoding and decoding will add spaces between tokens.

However this is not the case, tokenize() seems to map them to some token in merges.txt (https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/merges.txt) and then map them into values according to vocab.json.
Sometimes this seems to separate the text further into sub-tokens according to merges.txt and thus add spaces between them when decoding.

This behavior is very annoying because it treats non-English text and English text in the same way, decodes the sentence back to its original form, but randomly adds spaces inside these non-English words.

Thanks very much for your patience and help.

Topic		Replies	Views
Found some inconsistency on CLIPTokenizer, but how should we fix this? Intermediate	0	583	October 6, 2022
How to make tokenizer add the spaces correctly when decoding a sequence when set add_prefix_space=False 🤗Tokenizers	0	568	October 9, 2023
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	43	April 22, 2025
What does the parameter 'clean_up_tokenization_spaces' do in the tokenizer.decode function? Beginners	2	9052	July 8, 2025
How to decode with spaces? 🤗Tokenizers	0	1864	April 28, 2022

Urgent! Weird behavior of CLIPTokenizer when encoding out of vocabulary /non-English text with openai/clip-vit-base-patch32, and question about merges.txt

Reproduction

Expected behavior

Related topics