I’m encountering an issue when trying to load my custom tokenizer from a model repository on the Hugging Face Hub. Despite following the documentation for custom tokenizers.
To load the tokenizer, I’m using:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(“antoshka1608/wordpiece-tokenizer-v1”, use_fast=False, trust_remote_code=True)
When loading the tokenizer, it downloads tokenizer_config.json
and vocab.json
but then fails with the Tokenizer class BaseTokenizer does not exist or is not currently imported
error.
Has anyone else encountered this issue or have suggestions on what might be going wrong? Any guidance on troubleshooting this would be greatly appreciated!
Thank you for your help!
1 Like
That error usually occurs when the transoformers library is out of date, but it’s hard to imagine that you’re using a version so old that BaseTokeniker isn’t defined.
pip install -U transformers
There could be some other cause, such as the manual now being out of date.
Sorry, for, probably, misleading you, its not transformer’s tokenizer, but my custom ( i did it, inherited from PreTrainedTokenizer)
And yes, i reinstalled transformers mant times, doesnt work.
After your reply, i thought that this is might be issue with name duplication, but no. With new name of my tokenizer the same error(but of course with new tokenizer name ) .
1 Like
Oh, I see. The name is the same as the existing one, and the tokenizer is also a custom one. Have you encountered this bug?
Also, in the case of HF in general, there are cases where information that was true at the time is now false in the manual, so I think the easiest thing to do is to refer to the py or json of someone else’s model that is working. It’s tough if there is no similar model…
The next best thing is to refer to the code of the library itself.
1 Like
Yeah, thank you for you respond, anyway!!
Actually, if i import my tokenizer directly from
directory of my project.
tokenizer = CustomBaseTokenizer.from_pretrained(hugging_face_name)
It works good, but probably it takes just vocab file and doesnt use another files, so we cant check anything here.
Ill try your idea about other’s works later.
1 Like
Many of the HF libraries use hard-coded file names, so sometimes they work and sometimes they don’t. If it works locally but not online, the problem is often with the file names, their placement, or the YAML part of README.md (the de facto repository configuration file).
Also, in your case, since you’re using a gated model, there’s a chance that the error is occurring because you’re failing to pass the token.
Even when calling the tokenizer, you need a token to read the repo.
If it seems like a bug in the library, you’ll need to find a way to make the error more visible and identify the bug itself. Or you could find the part that’s causing the problem and bypass it.
In this case, I’ll go read the github…
Yep, going to do a deep dive later.
1 Like