CLIP: The `backend_tokenizer` provided does not match the expected format

I built a tokenizer and trained an LM from scratch following this link.
Then I used this to train a clip using this tokenizer. The training went fine.

Now, when I load this clip for evaluation, I get this error:
ValueError: The backend_tokenizerprovided does not match the expected format. The CLIP tokenizer has been heavily modified from transformers version 4.17.0. You need to convert the tokenizer you are using to be compatible with this version.The easiest way to do so isCLIPTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo, from_slow=True). If you want to use your existing tokenizer, you will have to revert to a version prior to 4.17.0 of transformers.

When I load the tokenizer I get the same error:
tokenizer = CLIPTokenizerFast.from_pretrained(“/home/user/ckpt10k”)
tokenizer = CLIPTokenizerFast.from_pretrained(“/home/user/ckpt10k”, from_slow=True)

Or

tokenizer = CLIPTokenizerFast.from_pretrained(“/home/user/ckpt10k”, from_slow=False)


The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'CLIPTokenizer'.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2123, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/models/clip/tokenization_clip.py", line 306, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

Can you please suggest the right way to load CLIP?

Hi @prabhatkr - what are the contents of “/home/user/ckpt10k” exactly? It seems that your checkpoint does not contain a vocab file, for instance. Do you have a minimal example of your train script, including tokenizer operations?

1 Like

I took some time to study and explore the CLIP. I was using FAST tokenizer having tokenizer.json only. I converted it to Slow and now it works fine. Thanks @Molbap

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.