"How to train a new language model from scratch using Transformers and Tokenizers" possibly requiring an update

Hello there

I was trying to run the code in the colab provided in the tutorial How to train a new language model from scratch using Transformers and Tokenizers

I run into a problem: when reaching the snippet
tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)
I constantly got the error
file ./EsperBERTo/config.json not found

So I did some research and find out that this line of code is possibly outdated:
tokenizer.save_model("EsperBERTo")

I then changed the code into
tokenizer.save_pretrained("EsperBERTo")
since, as in the documentation, the save_pretrained method allows to specify the save directory where

Directory where the configuration JSON file will be saved (will be created if it does not exist).

This made the previous error disappear

I guess that maybe the tutorial requires an update.

1 Like

Yes, we have a new version of that blog post coming soon!

2 Likes