"How to train a new language model from scratch using Transformers and Tokenizers" possibly requiring an update

Neuroinformatica · September 10, 2021, 3:37pm

Hello there

I was trying to run the code in the colab provided in the tutorial How to train a new language model from scratch using Transformers and Tokenizers

I run into a problem: when reaching the snippet
tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)
I constantly got the error
file ./EsperBERTo/config.json not found

So I did some research and find out that this line of code is possibly outdated:
tokenizer.save_model("EsperBERTo")

I then changed the code into
tokenizer.save_pretrained("EsperBERTo")
since, as in the documentation, the save_pretrained method allows to specify the save directory where

Directory where the configuration JSON file will be saved (will be created if it does not exist).

This made the previous error disappear

I guess that maybe the tutorial requires an update.

sgugger · September 13, 2021, 7:20am

Yes, we have a new version of that blog post coming soon!

hxz116 · March 25, 2022, 3:03am

new version of that blog link please?

elto · August 5, 2022, 1:18am

Hi, did you have the link to the new blog post? Thanks!

pierreguillou · November 1, 2022, 8:24pm

Hi, this is the new blog? cc @loubnabnl

Topic		Replies	Views
“How to train a new language model from scratch using Transformers and Tokenizers” not working properly (as of december 2021) Beginners	0	368	December 14, 2021
How to train from scratch with run_mlm.py, .txt file? Beginners	20	6782	September 22, 2024
Saving tokenizer's configuration Beginners	1	2813	February 24, 2022
Loading local tokenizer (RobertaTokenizerFast.from_pretrained) 🤗Transformers	0	1627	June 14, 2023
How to save my tokenizer using save_pretrained? Beginners	5	28984	August 13, 2021

"How to train a new language model from scratch using Transformers and Tokenizers" possibly requiring an update

Related topics