How to save my tokenizer using save_pretrained?

I have just followed this tutorial on how to train my own tokenizer.

Now, from training my tokenizer, I have wrapped it inside a Transformers object, so that I can use it with the transformers library:

from transformers import BertTokenizerFast

new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

Then, I try to save my tokenizer using this code:

tokenizer.save_pretrained('/content/drive/MyDrive/Tokenzier')

However, from executing the code above, I get this error:

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'save_pretrained'

Am I saving the tokenizer wrong?

If so, what is the correct approach to save it to my local files, so I can use it later?

You are saving the wrong tokenizer ;-). new_tokenizer.save_pretrained(xxx) should work.

1 Like

Thank you very much for that! And, one more thing… When I want to use my tokenizer for masked language modelling, do I use the pretrained model notebook?

I’m not sure which notebook you are referencing. If you want to train a language model from scratch on masked language modeling, it’s in this notebook.

I see - will take a look at that. So, after training my tokenizer, how do I use it for masked language modelling task?

@sgugger Do I replace the following with where I saved my trained tokenizer?

model_checkpoint = "bert-base-cased"
tokenizer_checkpoint = "sgugger/bert-like-tokenizer"