Simple Save/Load of tokenizer not working

abhi · November 3, 2020, 12:29am

I am training a DistilBert pretrained model for sequence classification with a pretrained tokenizer.

I currently save the model like this:

> model.save_pretrained(dir)
> tokenizer.save_pretrained(dir)

And load like this:

> model.from_pretrained(dir)
> tokenizer.from_pretrained(dir)).

Weirdly this produces bad results (by over 10%) because the tokenizer has somehow changed. Instead this works much better:

> model.from_pretrained(dir)
> (DistilBertTokenizer.from_pretrained('distilbert-base-cased')).

There must be something simple I’m missing here. Why won’t it work properly if I just load the tokenizer directly from dir.

PROGRESS

Upon further investigation I noticed that

> tokenizer.from_pretrained(dir)).

defaults to

> set_lower_case=True.

Manually setting this fixes my issue. However, this doesn’t make sense to me as the model before saving had set_lower_case=False. Is there any way it can be saved directly with this option?

thomwolf · November 3, 2020, 8:28am

Yes this was fixed in (https://github.com/huggingface/transformers/pull/8006) and the fix will be in the next release

abhi · November 4, 2020, 12:03am

Great, thanks!

Topic		Replies	Views
Issues with save_pretrained (MarianMT) Beginners	1	656	April 11, 2023
Load fine tuned model from local Beginners	4	10287	October 20, 2020
Issue with using a save_pretrained model (MarianMT) 🤗Transformers	1	448	April 5, 2023
Can't save my finetuned model Beginners	5	215	November 9, 2024
Can't load pre-trained tokenizer with additional new tokens 🤗Transformers	3	4427	August 10, 2021

Simple Save/Load of tokenizer not working

Related topics