I am training a DistilBert pretrained model for sequence classification with a pretrained tokenizer.
I currently save the model like this:
> model.save_pretrained(dir)
> tokenizer.save_pretrained(dir)
And load like this:
> model.from_pretrained(dir)
> tokenizer.from_pretrained(dir)).
Weirdly this produces bad results (by over 10%) because the tokenizer has somehow changed. Instead this works much better:
> model.from_pretrained(dir)
> (DistilBertTokenizer.from_pretrained('distilbert-base-cased')).
There must be something simple I’m missing here. Why won’t it work properly if I just load the tokenizer directly from dir.
PROGRESS
Upon further investigation I noticed that
> tokenizer.from_pretrained(dir)).
defaults to
> set_lower_case=True.
Manually setting this fixes my issue. However, this doesn’t make sense to me as the model before saving had set_lower_case=False.
Is there any way it can be saved directly with this option?