Missing keys in RobertaForMaskedLM state dict


I am training my model that is based on RobertaForMaskedLM and after saving it I get a pytorch_model.bin file from which I want to retrieve certain layer parameters.

After I do
state_dict = torch.load(path_to_pytorch_model.bin, map_location=‘cpu’)
and inspect the state_dict.keys() I observe that two parameters are missing: lm_head.decoder.weight and lm_head.decoder.bias

From what I understand, lm_head.decoder.weight = roberta.embeddings.word_embeddings.weight which is due to the tied weights.

What I have not figured out is where are stored the parameters of lm_head.decoder.bias. Are these unused parameters ?

Thanks for your help !

Those are not used since they are tied weight. You should use strict=False when loading the model then retie the weights with roberta.tie_weights() (or jsut use from_pretrained which will do all of that for you).

Thank you for your reply. These two ways you describe indeed give me same results
calling from_pretrained
or load_state_dict with strict=False followed by tie_weights()
then the parameters for lm_head.decoder are the same both in terms of weight (same as that of the embeddings.word_embeddings) and bias.

What still puzzle me is to what is tied the bias parameters (or where it is stored because it is not in the state_dict keys), this I could not get it from the source code.

If for example I set to zeros the lm_head.decoder.bias then the model output is not the same anymore.
And if I check for gradients during the backward pass, the param.grad from lm_head.decoder.bias are neither None or zero which would mean it is also trained.

going through all parameters with torch.allclose against lm_head.decoder.weight and lm_head.decoder.bias seems to have given me the answer

torch.allclose(roberta.embeddings.word_embeddings.weight, lm_head.decoder.weight) = True
torch.allclose(lm_head.bias, lm_head.decoder.bias) = True

so it seems that lm_head.decoder.bias and lm_head.bias are tied … is that right ?
do you know why is this bias parameter replicated ?

This is a very hack way of ensuring lm_head.bias gets resized when we add new tokens, since only lm_head.decoder gets passed to that function, coupled with some technical debt of having created that weight in the first place instead of just relying on lm_head.decoder. Not the nicest part of the library :sweat_smile:

Ahah sorry for digging this out !
But I needed to clarify that so thank you very much for helping.

1 Like