“OSError: Model name './XX' was not found in tokenizers model name list” - cannot load custom tokenizer in Transformers

I’m trying to create tokenizer with my own dataset/vocabulary using Sentencepiece and then use it with AlbertTokenizer transformers.

I followed really closely the tutorial on how to train a model from scratch: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=hO5M3vrAhcuj

    # import relevant libraries   
    from pathlib import Path
    from tokenizers import SentencePieceBPETokenizer
    from tokenizers.implementations import SentencePieceBPETokenizer
    from tokenizers.processors import BertProcessing
    from transformers import AlbertTokenizer
    

    paths = [str(x) for x in Path("./data").glob("**/*.txt")]
    
    # Initialize a tokenizer
    tokenizer = SentencePieceBPETokenizer(add_prefix_space=True)
    
    # Customize training
    tokenizer.train(files=paths, 
                    vocab_size=32000, 
                    min_frequency=2, 
                    show_progress=True,
                    special_tokens=['<unk>'],)

    # Saving model
    tokenizer.save_model("Sent-AlBERT")

    tokenizer = SentencePieceBPETokenizer(
        "./Sent-AlBERT/vocab.json",
        "./Sent-AlBERT/merges.txt",)

    tokenizer.enable_truncation(max_length=512)

Everything is fine up until this point when I tried to re-create the tokenizer in transformers

    # Re-create our tokenizer in transformers
        tokenizer = AlbertTokenizer.from_pretrained("./Sent-AlBERT", do_lower_case=True)

This is the error message I kept receiving:

OSError: Model name './Sent-AlBERT' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed './Sent-AlBERT' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

For some reason, it works with RobertaTokenizerFast but not with AlbertTokenzier.

If anyone could give me a suggestion or any sort of direction on how to use Sentencepiece with AlberTokenizer I would really appreciate it.

P.S: I also tried to use ByteLevelBPETokenizer with DistilBertTokenizer, but it couldn’t recognize the tokenizer in the transformer either. I’m not sure what I am missing here.

You can’t directly use this tokenizer with a “slow” tokenizer (not backed by rust) there is a conversion step to do (not super versed in it but maybe @thomwolf can chime in?).
It should work with AlbertTokenizerFast (which has more functionality and is faster so it should be a win-win over all!)

Thank you so much for your comment but I don’t think there is AlbertTokenizerFast? I didn’t see it in the official documentation: https://huggingface.co/transformers/model_doc/albert.html#alberttokenizer

Oh that’s a mistake in the docs. There is definitely one (as you can see on this table).

I tried but received the following error: ImportError: cannot import name 'AlbertTokenizerFast'

That should work if tokenizers is installed. You can also try directly importing from transformers.models.albert.

No this should always work since the object is always present in the init (if tokenziers is not installed, an error is raised when you try to actually use that object). If it does not work, it means you don’t have v4 of transformers in your environment.

Ah, it is because I was using transformers == 3.3.1
After upgrading to v4 and import AlbertTokenizerFast, I received the following error:

from transformers import AlbertTokenizerFast

# Re-create our tokenizer in transformers
tokenizer = AlbertTokenizerFast.from_pretrained("./Sent-AlBERT")

OSError: Can't load tokenizer for './Sent-AlBERT'. Make sure that:

- './Sent-AlBERT' is a correct model identifier listed on 'https://huggingface.co/models'

- or './Sent-AlBERT' is the correct path to a directory containing relevant tokenizer files

Isn’t AlbertTokenizerFast only available in from transformers if tokenizers is installed? I don’t see where else it is imported in transformers.init.

https://github.com/huggingface/transformers/blob/e977ed2142a022aa969c03836340edcff4f479b2/src/transformers/init.py#L241-L242

@tlqnguyen Make sure the path is correct. You are using a relative path now, so make sure it is relative to your current script. If you can’t get it to work, try an absolute path.

There is an else here that imports a dummy object with the same name and will tell you to install tokenizers if not available :slight_smile:

This is a script I added so that the init always has the same objects.

Aha, that makes sense. Thanks for the clarification.

Is your question solved?What is the solution to this problem? The reason why the pre-training model failed to load have nothing to do with the Transformers version. I also hive this question, when i use pre-training model ,named ’ uer/chinese_roberta_L-8_H-512’. my Transformers version is 3.4.0.

Actually, I think it might have to do with the Transformers version. As I upgraded Transformers to 4.0.0, it worked and I was able to load the tokenizer :slight_smile: