Hello,
I am training my model that is based on RobertaForMaskedLM and after saving it I get a pytorch_model.bin file from which I want to retrieve certain layer parameters.
After I do
state_dict = torch.load(path_to_pytorch_model.bin, map_location=‘cpu’)
and inspect the state_dict.keys() I observe that two parameters are missing: lm_head.decoder.weight and lm_head.decoder.bias
From what I understand, lm_head.decoder.weight = roberta.embeddings.word_embeddings.weight which is due to the tied weights.
What I have not figured out is where are stored the parameters of lm_head.decoder.bias. Are these unused parameters ?
Thanks for your help !
Those are not used since they are tied weight. You should use strict=False
when loading the model then retie the weights with roberta.tie_weights()
(or jsut use from_pretrained
which will do all of that for you).
Thank you for your reply. These two ways you describe indeed give me same results
calling from_pretrained
or load_state_dict with strict=False followed by tie_weights()
then the parameters for lm_head.decoder are the same both in terms of weight (same as that of the embeddings.word_embeddings) and bias.
What still puzzle me is to what is tied the bias parameters (or where it is stored because it is not in the state_dict keys), this I could not get it from the source code.
If for example I set to zeros the lm_head.decoder.bias then the model output is not the same anymore.
And if I check for gradients during the backward pass, the param.grad from lm_head.decoder.bias are neither None or zero which would mean it is also trained.
going through all parameters with torch.allclose against lm_head.decoder.weight and lm_head.decoder.bias seems to have given me the answer
torch.allclose(roberta.embeddings.word_embeddings.weight, lm_head.decoder.weight) = True
torch.allclose(lm_head.bias, lm_head.decoder.bias) = True
so it seems that lm_head.decoder.bias and lm_head.bias are tied … is that right ?
do you know why is this bias parameter replicated ?
This is a very hack way of ensuring lm_head.bias
gets resized when we add new tokens, since only lm_head.decoder
gets passed to that function, coupled with some technical debt of having created that weight in the first place instead of just relying on lm_head.decoder
. Not the nicest part of the library
Ahah sorry for digging this out !
But I needed to clarify that so thank you very much for helping.
1 Like