Found some inconsistency on CLIPTokenizer, but how should we fix this?

jinseokim · October 6, 2022, 7:59am

# only reproducible when ftfy is NOT installed

model_name = 'openai/clip-vit-large-patch14'

tokenizer_s = CLIPTokenizer.from_pretrained(model_name)
tokenizer_f = CLIPTokenizerFast.from_pretrained(model_name)

tokenizer_s("--") # {'input_ids': [49406, 268, 268, 49407], ...}
tokenizer_f("--") # {'input_ids': [49406, 2432, 49407], ...}

tokenizer_s("résumé") # {'input_ids': [49406, 15077, 49407], ...}
tokenizer_f("résumé") # {'input_ids': [49406, 29106, 7054, 4166, 49407], ...}

This behavior happens because CLIPTokenizer tries to fix text via BasicTokenizer when ftfy is not installed. BasicTokenizer strips accents, regards consecutive punctuations as separate tokens, and squeezes whitespaces in default, while OpenAI’s implementation just fixes mojibake, normalize string as NFC(this is done by ftfy) and squeezes whitespaces.

You can find OpenAI’s tokenizer includes consecutive punctuations and word with accents as a single token:

tokenizer_s.get_vocab()
# {
#   ...
#     '--': 2154,
#   ...
#     'rÃ©': 29106, (this is 'ré')
#   ...
# }

The easy fix I thought first was simply remove BasicTokenizer’s behavior at CLIPTokenizer. However I worry that this may harm the performance of fine-tuned model which was trained with old tokenizer’s tokens.

So I come up with the idea to add this behavior as an option, but how? What name for this option? basictokenizer_behaivor? old_tokenizer? do_strip_and_split_punctuations? What should be the default value of this option?

Topic		Replies	Views
Urgent! Weird behavior of CLIPTokenizer when encoding out of vocabulary /non-English text with openai/clip-vit-base-patch32, and question about merges.txt 🤗Transformers	0	286	November 13, 2022
Tokenizer ignores repeated whitespaces 🤗Tokenizers	3	3314	May 19, 2022
CLIP: The `backend_tokenizer` provided does not match the expected format 🤗Transformers	3	234	May 9, 2024
Using whitespace tokenizer for training models 🤗Tokenizers	1	3221	June 6, 2021
Preprocessing raw text 🤗Tokenizers	2	593	October 26, 2022

Found some inconsistency on CLIPTokenizer, but how should we fix this?

Related topics