Found some inconsistency on CLIPTokenizer, but how should we fix this?

# only reproducible when ftfy is NOT installed

model_name = 'openai/clip-vit-large-patch14'

tokenizer_s = CLIPTokenizer.from_pretrained(model_name)
tokenizer_f = CLIPTokenizerFast.from_pretrained(model_name)

tokenizer_s("--") # {'input_ids': [49406, 268, 268, 49407], ...}
tokenizer_f("--") # {'input_ids': [49406, 2432, 49407], ...}

tokenizer_s("résumé") # {'input_ids': [49406, 15077, 49407], ...}
tokenizer_f("résumé") # {'input_ids': [49406, 29106, 7054, 4166, 49407], ...}

This behavior happens because CLIPTokenizer tries to fix text via BasicTokenizer when ftfy is not installed. BasicTokenizer strips accents, regards consecutive punctuations as separate tokens, and squeezes whitespaces in default, while OpenAI’s implementation just fixes mojibake, normalize string as NFC(this is done by ftfy) and squeezes whitespaces.

You can find OpenAI’s tokenizer includes consecutive punctuations and word with accents as a single token:

tokenizer_s.get_vocab()
# {
#   ...
#     '--': 2154,
#   ...
#     'ré': 29106, (this is 'ré')
#   ...
# }

The easy fix I thought first was simply remove BasicTokenizer’s behavior at CLIPTokenizer. However I worry that this may harm the performance of fine-tuned model which was trained with old tokenizer’s tokens.

So I come up with the idea to add this behavior as an option, but how? What name for this option? basictokenizer_behaivor? old_tokenizer? do_strip_and_split_punctuations? What should be the default value of this option?

1 Like