Why is the lm_head layer in GPT2LMHeadModel not a parameter?

Hi all, I recently raised an issue asking why the lm_head layer is not a parameters. The response I got is that “It is tied to the input layer”. After reading the docs as suggested I found that the lm_head layer is a “linear layer with weights tied to the input embeddings”.

I still don’t understand what that means, as I thought the lm_head layer would output a tensor shaped (*, vocab_size) whereas the embedding is shaped (vocab_size, embedding_size)?

Does that mean if I want to fine-tune the lm_head layer, I would need to fine-tune the embedding layer (wte)?

The embedding matrix has a size vocab_size, embedding_size. The lm_head linear layer has weights of size embedding_size,vocab_size, so you can use the transpose of the embedding matrix for that final lm layer in terms of shape (and in PyTorch, the weights of a linear layer are stored transposed, so you can just use the same matrix).

As for the why: if the world the is encoded as [0.1, -0.3, 0.2] (for instance) and you predict [0.099, =0.302, 0.18], you probably want to to predict something very close to the, which is why we use the same weights. That way the model only learns one representation of embedding vectors. This trick was first introduced for LSTMs a while ago.

3 Likes

Hi thanks for your explanation, I understand the first part now but still a bit uncertain about why this is the case.

If I understood your example correctly, if the word the’s has an embedding [0.1, -0.3, 0.2] in the embedding matrix, and if the output of the decoder (before feeding it into the lm head) have a vector [0.099, =0.302, 0.18], we would want to predict something close to the. What I don’t understand however, is why using the transposed embedding matrix would be better than initialising another matrix to achieve this. What do you mean by only having to learn one representation of embedding vectors?

Thanks in advance!

1 Like

Well another matrix would give you random results. By using the same as the embedding matrix, the result when applied to a hidden state h will be a vector of size vocab_size where the coordinates i will be the highest when h is similar to e_{i} the i-th vector of the embedding matrix. Why? If you do the math, the result of the decoder will have for coordinates the dot products h . e_{i} and, as long as every vector is bounded, is the highest when the cosine between h and e_{i} is close to 1.

1 Like

I see, it makes total sense now thanks a lot!

1 Like