“OSError: Model name './XX' was not found in tokenizers model name list” - cannot load custom tokenizer in Transformers

tlqnguyen · December 9, 2020, 7:46am

I’m trying to create tokenizer with my own dataset/vocabulary using Sentencepiece and then use it with AlbertTokenizer transformers.

I followed really closely the tutorial on how to train a model from scratch: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=hO5M3vrAhcuj

    # import relevant libraries   
    from pathlib import Path
    from tokenizers import SentencePieceBPETokenizer
    from tokenizers.implementations import SentencePieceBPETokenizer
    from tokenizers.processors import BertProcessing
    from transformers import AlbertTokenizer
    

    paths = [str(x) for x in Path("./data").glob("**/*.txt")]
    
    # Initialize a tokenizer
    tokenizer = SentencePieceBPETokenizer(add_prefix_space=True)
    
    # Customize training
    tokenizer.train(files=paths, 
                    vocab_size=32000, 
                    min_frequency=2, 
                    show_progress=True,
                    special_tokens=['<unk>'],)

    # Saving model
    tokenizer.save_model("Sent-AlBERT")

    tokenizer = SentencePieceBPETokenizer(
        "./Sent-AlBERT/vocab.json",
        "./Sent-AlBERT/merges.txt",)

    tokenizer.enable_truncation(max_length=512)

Everything is fine up until this point when I tried to re-create the tokenizer in transformers

    # Re-create our tokenizer in transformers
        tokenizer = AlbertTokenizer.from_pretrained("./Sent-AlBERT", do_lower_case=True)

This is the error message I kept receiving:

OSError: Model name './Sent-AlBERT' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed './Sent-AlBERT' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

For some reason, it works with RobertaTokenizerFast but not with AlbertTokenzier.

If anyone could give me a suggestion or any sort of direction on how to use Sentencepiece with AlberTokenizer I would really appreciate it.

P.S: I also tried to use ByteLevelBPETokenizer with DistilBertTokenizer, but it couldn’t recognize the tokenizer in the transformer either. I’m not sure what I am missing here.

sgugger · December 9, 2020, 12:50pm

You can’t directly use this tokenizer with a “slow” tokenizer (not backed by rust) there is a conversion step to do (not super versed in it but maybe @thomwolf can chime in?).
It should work with AlbertTokenizerFast (which has more functionality and is faster so it should be a win-win over all!)

tlqnguyen · December 9, 2020, 1:24pm

Thank you so much for your comment but I don’t think there is AlbertTokenizerFast? I didn’t see it in the official documentation: https://huggingface.co/transformers/model_doc/albert.html#alberttokenizer

sgugger · December 9, 2020, 1:56pm

Oh that’s a mistake in the docs. There is definitely one (as you can see on this table).

tlqnguyen · December 9, 2020, 2:33pm

I tried but received the following error: ImportError: cannot import name 'AlbertTokenizerFast'

BramVanroy · December 9, 2020, 2:59pm

That should work if tokenizers is installed. You can also try directly importing from transformers.models.albert.

sgugger · December 9, 2020, 3:19pm

No this should always work since the object is always present in the init (if tokenziers is not installed, an error is raised when you try to actually use that object). If it does not work, it means you don’t have v4 of transformers in your environment.

tlqnguyen · December 9, 2020, 3:47pm

Ah, it is because I was using transformers == 3.3.1
After upgrading to v4 and import AlbertTokenizerFast, I received the following error:

from transformers import AlbertTokenizerFast

# Re-create our tokenizer in transformers
tokenizer = AlbertTokenizerFast.from_pretrained("./Sent-AlBERT")

OSError: Can't load tokenizer for './Sent-AlBERT'. Make sure that:

- './Sent-AlBERT' is a correct model identifier listed on 'https://huggingface.co/models'

- or './Sent-AlBERT' is the correct path to a directory containing relevant tokenizer files

BramVanroy · December 22, 2020, 9:04am

Isn’t AlbertTokenizerFast only available in from transformers if tokenizers is installed? I don’t see where else it is imported in transformers.init.

https://github.com/huggingface/transformers/blob/e977ed2142a022aa969c03836340edcff4f479b2/src/transformers/init.py#L241-L242

@tlqnguyen Make sure the path is correct. You are using a relative path now, so make sure it is relative to your current script. If you can’t get it to work, try an absolute path.

sgugger · December 22, 2020, 2:28pm

There is an else here that imports a dummy object with the same name and will tell you to install tokenizers if not available

This is a script I added so that the init always has the same objects.

BramVanroy · December 22, 2020, 3:45pm

Aha, that makes sense. Thanks for the clarification.

ningmoufubi · January 25, 2021, 7:21am

Is your question solved？What is the solution to this problem? The reason why the pre-training model failed to load have nothing to do with the Transformers version. I also hive this question, when i use pre-training model ,named ’ uer/chinese_roberta_L-8_H-512’. my Transformers version is 3.4.0.

tlqnguyen · January 25, 2021, 8:41am

Actually, I think it might have to do with the Transformers version. As I upgraded Transformers to 4.0.0, it worked and I was able to load the tokenizer

Sumit6597 · March 27, 2022, 7:33pm

I am trying to use the below pre trained model but its giving the same error : (OSError: Can’t load tokenizer for ‘copenlu/citebert-cite-only’. If you were trying to load it from ‘Models - Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘copenlu/citebert-cite-only’ is the correct path to a directory containing all relevant files for a BertTokenizerFast tokenizer.) My transformer version is 4.17.0. Please let me know.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained(“copenlu/citebert-cite-only”)

model = AutoModel.from_pretrained(“copenlu/citebert-cite-only”)

Unspoiled-Egg · April 25, 2023, 12:18pm

Is your question solved?

Topic		Replies	Views
Error with new tokenizers (URGENT!) 🤗Tokenizers	16	51291	July 22, 2024
OSError: Model name 'gpt2' was not found in tokenizers model name list (gpt2,...) 🤗Tokenizers	8	7414	August 10, 2023
How to change default tokenizer name in Use in Transformers Model cards	0	2253	May 4, 2022
Cant load deberta tokenizer Beginners	0	678	March 27, 2021
Couldn't instantiate the backend tokenizer 🤗Tokenizers	0	2298	December 7, 2020

“OSError: Model name './XX' was not found in tokenizers model name list” - cannot load custom tokenizer in Transformers

Related topics