Why is the lm_head layer in GPT2LMHeadModel not a parameter?

itsmejim · August 8, 2020, 12:11pm

Hi all, I recently raised an issue asking why the lm_head layer is not a parameters. The response I got is that “It is tied to the input layer”. After reading the docs as suggested I found that the lm_head layer is a “linear layer with weights tied to the input embeddings”.

I still don’t understand what that means, as I thought the lm_head layer would output a tensor shaped (*, vocab_size) whereas the embedding is shaped (vocab_size, embedding_size)?

Does that mean if I want to fine-tune the lm_head layer, I would need to fine-tune the embedding layer (wte)?

sgugger · August 10, 2020, 11:46am

The embedding matrix has a size vocab_size, embedding_size. The lm_head linear layer has weights of size embedding_size,vocab_size, so you can use the transpose of the embedding matrix for that final lm layer in terms of shape (and in PyTorch, the weights of a linear layer are stored transposed, so you can just use the same matrix).

As for the why: if the world the is encoded as [0.1, -0.3, 0.2] (for instance) and you predict [0.099, =0.302, 0.18], you probably want to to predict something very close to the, which is why we use the same weights. That way the model only learns one representation of embedding vectors. This trick was first introduced for LSTMs a while ago.

itsmejim · August 13, 2020, 12:07pm

Hi thanks for your explanation, I understand the first part now but still a bit uncertain about why this is the case.

If I understood your example correctly, if the word the’s has an embedding [0.1, -0.3, 0.2] in the embedding matrix, and if the output of the decoder (before feeding it into the lm head) have a vector [0.099, =0.302, 0.18], we would want to predict something close to the. What I don’t understand however, is why using the transposed embedding matrix would be better than initialising another matrix to achieve this. What do you mean by only having to learn one representation of embedding vectors?

Thanks in advance!

sgugger · August 13, 2020, 12:22pm

Well another matrix would give you random results. By using the same as the embedding matrix, the result when applied to a hidden state h will be a vector of size vocab_size where the coordinates i will be the highest when h is similar to e_{i} the i-th vector of the embedding matrix. Why? If you do the math, the result of the decoder will have for coordinates the dot products h . e_{i} and, as long as every vector is bounded, is the highest when the cosine between h and e_{i} is close to 1.

itsmejim · August 13, 2020, 2:14pm

I see, it makes total sense now thanks a lot!

siarez · September 29, 2023, 12:29am

What I don’t understand however, is why using the transposed embedding matrix would be better than initialising another matrix to achieve this.

You can totally initialize another matrix for the LM head. In fact not all models tie the LM head weights to the embedding weights. For example, GPT-J has separate weights for its LM head and its embedding table.

Topic		Replies	Views
How the lm_head weights are tight to embeddings in GPT2LMHeadModel? Beginners	0	719	December 18, 2021
Perplexity from fine-tuned GPT2LMHeadModel with and without lm_head as a parameter Intermediate	4	2039	May 10, 2022
What is LM head mean? Beginners	5	18816	September 26, 2023
What is the `tie_word_embeddings` option exactly doing? 🤗Transformers	3	12720	October 15, 2022
Missing keys in RobertaForMaskedLM state dict 🤗Transformers	5	2058	August 5, 2022

Why is the lm_head layer in GPT2LMHeadModel not a parameter?

Related topics