Hello,

I am training my model that is based on RobertaForMaskedLM and after saving it I get a pytorch_model.bin file from which I want to retrieve certain layer parameters.

After I do

state_dict = torch.load(path_to_pytorch_model.bin, map_location=‘cpu’)

and inspect the state_dict.keys() I observe that two parameters are missing: lm_head.decoder.weight and lm_head.decoder.bias

From what I understand, lm_head.decoder.weight = roberta.embeddings.word_embeddings.weight which is due to the tied weights.

What I have not figured out is where are stored the parameters of lm_head.decoder.bias. Are these unused parameters ?

Thanks for your help !

Those are not used since they are tied weight. You should use `strict=False`

when loading the model then retie the weights with `roberta.tie_weights()`

(or jsut use `from_pretrained`

which will do all of that for you).

Thank you for your reply. These two ways you describe indeed give me same results

calling from_pretrained

or load_state_dict with strict=False followed by tie_weights()

then the parameters for lm_head.decoder are the same both in terms of weight (same as that of the embeddings.word_embeddings) and bias.

What still puzzle me is to what is tied the bias parameters (or where it is stored because it is not in the state_dict keys), this I could not get it from the source code.

If for example I set to zeros the lm_head.decoder.bias then the model output is not the same anymore.

And if I check for gradients during the backward pass, the param.grad from lm_head.decoder.bias are neither None or zero which would mean it is also trained.

going through all parameters with torch.allclose against lm_head.decoder.weight and lm_head.decoder.bias seems to have given me the answer

torch.allclose(roberta.embeddings.word_embeddings.weight, lm_head.decoder.weight) = True

torch.allclose(lm_head.bias, lm_head.decoder.bias) = True

so it seems that lm_head.decoder.bias and lm_head.bias are tied … is that right ?

do you know why is this bias parameter replicated ?

This is a very hack way of ensuring `lm_head.bias`

gets resized when we add new tokens, since only `lm_head.decoder`

gets passed to that function, coupled with some technical debt of having created that weight in the first place instead of just relying on `lm_head.decoder`

. Not the nicest part of the library

Ahah sorry for digging this out !

But I needed to clarify that so thank you very much for helping.

1 Like