Missing keys in RobertaForMaskedLM state dict

adrienchaton · August 5, 2022, 1:45pm

Hello,

I am training my model that is based on RobertaForMaskedLM and after saving it I get a pytorch_model.bin file from which I want to retrieve certain layer parameters.

After I do
state_dict = torch.load(path_to_pytorch_model.bin, map_location=‘cpu’)
and inspect the state_dict.keys() I observe that two parameters are missing: lm_head.decoder.weight and lm_head.decoder.bias

From what I understand, lm_head.decoder.weight = roberta.embeddings.word_embeddings.weight which is due to the tied weights.

What I have not figured out is where are stored the parameters of lm_head.decoder.bias. Are these unused parameters ?

Thanks for your help !

sgugger · August 5, 2022, 1:55pm

Those are not used since they are tied weight. You should use strict=False when loading the model then retie the weights with roberta.tie_weights() (or jsut use from_pretrained which will do all of that for you).

adrienchaton · August 5, 2022, 2:20pm

Thank you for your reply. These two ways you describe indeed give me same results
calling from_pretrained
or load_state_dict with strict=False followed by tie_weights()
then the parameters for lm_head.decoder are the same both in terms of weight (same as that of the embeddings.word_embeddings) and bias.

What still puzzle me is to what is tied the bias parameters (or where it is stored because it is not in the state_dict keys), this I could not get it from the source code.

If for example I set to zeros the lm_head.decoder.bias then the model output is not the same anymore.
And if I check for gradients during the backward pass, the param.grad from lm_head.decoder.bias are neither None or zero which would mean it is also trained.

adrienchaton · August 5, 2022, 2:32pm

going through all parameters with torch.allclose against lm_head.decoder.weight and lm_head.decoder.bias seems to have given me the answer

torch.allclose(roberta.embeddings.word_embeddings.weight, lm_head.decoder.weight) = True
torch.allclose(lm_head.bias, lm_head.decoder.bias) = True

so it seems that lm_head.decoder.bias and lm_head.bias are tied … is that right ?
do you know why is this bias parameter replicated ?

sgugger · August 5, 2022, 2:34pm

This is a very hack way of ensuring lm_head.bias gets resized when we add new tokens, since only lm_head.decoder gets passed to that function, coupled with some technical debt of having created that weight in the first place instead of just relying on lm_head.decoder. Not the nicest part of the library

adrienchaton · August 5, 2022, 2:38pm

Ahah sorry for digging this out !
But I needed to clarify that so thank you very much for helping.

Topic		Replies	Views
How does "_tied_weights_keys" work? Beginners	0	535	January 3, 2025
Len(trainer.model.state_dict().keys()) reduced after calling trainer.train() 🤗Transformers	0	274	June 8, 2023
Weights in BERT model 🤗Transformers	1	1718	April 12, 2023
I'm having trouble saving and loading the model, the state dictionary doesn't correspond Beginners	1	588	June 8, 2023
Not able to reload all weights after training 🤗Transformers	0	574	April 11, 2022

Missing keys in RobertaForMaskedLM state dict

Related topics