# only reproducible when ftfy is NOT installed
model_name = 'openai/clip-vit-large-patch14'
tokenizer_s = CLIPTokenizer.from_pretrained(model_name)
tokenizer_f = CLIPTokenizerFast.from_pretrained(model_name)
tokenizer_s("--") # {'input_ids': [49406, 268, 268, 49407], ...}
tokenizer_f("--") # {'input_ids': [49406, 2432, 49407], ...}
tokenizer_s("résumé") # {'input_ids': [49406, 15077, 49407], ...}
tokenizer_f("résumé") # {'input_ids': [49406, 29106, 7054, 4166, 49407], ...}
This behavior happens because CLIPTokenizer
tries to fix text via BasicTokenizer
when ftfy is not installed. BasicTokenizer
strips accents, regards consecutive punctuations as separate tokens, and squeezes whitespaces in default, while OpenAI’s implementation just fixes mojibake, normalize string as NFC(this is done by ftfy) and squeezes whitespaces.
You can find OpenAI’s tokenizer includes consecutive punctuations and word with accents as a single token:
tokenizer_s.get_vocab()
# {
# ...
# '--': 2154,
# ...
# 'ré': 29106, (this is 'ré')
# ...
# }
The easy fix I thought first was simply remove BasicTokenizer
’s behavior at CLIPTokenizer. However I worry that this may harm the performance of fine-tuned model which was trained with old tokenizer’s tokens.
So I come up with the idea to add this behavior as an option, but how? What name for this option? basictokenizer_behaivor
? old_tokenizer?
do_strip_and_split_punctuations
? What should be the default value of this option?