CLIP: The `backend_tokenizer` provided does not match the expected format

prabhatkr · May 7, 2024, 9:16pm

I built a tokenizer and trained an LM from scratch following this link.
Then I used this to train a clip using this tokenizer. The training went fine.

Now, when I load this clip for evaluation, I get this error:
ValueError: The backend_tokenizerprovided does not match the expected format. The CLIP tokenizer has been heavily modified from transformers version 4.17.0. You need to convert the tokenizer you are using to be compatible with this version.The easiest way to do so isCLIPTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo, from_slow=True). If you want to use your existing tokenizer, you will have to revert to a version prior to 4.17.0 of transformers.

When I load the tokenizer I get the same error:
tokenizer = CLIPTokenizerFast.from_pretrained(“/home/user/ckpt10k”)
tokenizer = CLIPTokenizerFast.from_pretrained(“/home/user/ckpt10k”, from_slow=True)

Or

tokenizer = CLIPTokenizerFast.from_pretrained(“/home/user/ckpt10k”, from_slow=False)


The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'CLIPTokenizer'.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2123, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniforge-pypy3/envs/clip/lib/python3.12/site-packages/transformers/models/clip/tokenization_clip.py", line 306, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

Can you please suggest the right way to load CLIP?

Molbap · May 9, 2024, 2:23pm

Hi @prabhatkr - what are the contents of “/home/user/ckpt10k” exactly? It seems that your checkpoint does not contain a vocab file, for instance. Do you have a minimal example of your train script, including tokenizer operations?

prabhatkr · May 9, 2024, 4:52pm

I took some time to study and explore the CLIP. I was using FAST tokenizer having tokenizer.json only. I converted it to Slow and now it works fine. Thanks @Molbap

system · May 10, 2024, 4:52am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AutoTokenizer.from_pretrained() suddenly raises an error 🤗Transformers	4	97	May 7, 2025
Why does PreTrainedTokenizerFast return a list instead of tokenizers.Encoding instance? Beginners	1	316	February 6, 2023
Questions when doing Transformer-XL Finetune with Trainer Beginners	3	1057	October 6, 2021
Convert a Python Tokenizer into a TokenizerFast Beginners	0	339	May 20, 2022
Error with new tokenizers (URGENT!) 🤗Tokenizers	16	51228	July 22, 2024

CLIP: The `backend_tokenizer` provided does not match the expected format

Related topics