Simple Save/Load of tokenizer not working

I am training a DistilBert pretrained model for sequence classification with a pretrained tokenizer.

I currently save the model like this:

> model.save_pretrained(dir)
> tokenizer.save_pretrained(dir)

And load like this:

> model.from_pretrained(dir)
> tokenizer.from_pretrained(dir)).

Weirdly this produces bad results (by over 10%) because the tokenizer has somehow changed. Instead this works much better:

> model.from_pretrained(dir)
> (DistilBertTokenizer.from_pretrained('distilbert-base-cased')).

There must be something simple I’m missing here. Why won’t it work properly if I just load the tokenizer directly from dir.

PROGRESS

Upon further investigation I noticed that

> tokenizer.from_pretrained(dir)).

defaults to

> set_lower_case=True.

Manually setting this fixes my issue. However, this doesn’t make sense to me as the model before saving had set_lower_case=False. Is there any way it can be saved directly with this option?

1 Like

Yes this was fixed in (https://github.com/huggingface/transformers/pull/8006) and the fix will be in the next release :slight_smile:

3 Likes

Great, thanks!