Why do GPT2 initialize the weights of residual layers?

minji · January 11, 2023, 5:24am

I had a question while reading the paper gpt2.

Isn’t the residual layer a block that stores the value to transfer the output value of the previous layer to the later layer, not the value to be learned?

This value is the output of a specific layer, but I don’t know why this layer need to initialized.

Also, it is said to be scaled as 1/root(N),

 p.data.normal_(mean=0.0, std=(self.config.initializer_range / math.sqrt(2 * self.config.n_layer)))

but I don’t know what the difference is from multiplying the previous layer value by 1/root(N) without initialization.

Topic		Replies	Views
I am using TFGPT2LMHeadModel and GPT2LMHeadModel.When i use GPT2LMHeadModel weight to initialize TFGPT2LMHeadModel, there is some weight is not used.I'm comfirm the config file is the same one, but why is it happened? 🤗Transformers	0	275	July 28, 2022
Can we resize embedding with embedding weighted initialized differently? 🤗Transformers	0	1365	August 18, 2020
GPT2LMHeadModel.from_pretrained('gpt2') not loading attn weights Beginners	1	2133	July 22, 2020
How to reset a layer? Beginners	2	3854	November 30, 2021
Initializing the weights of the final layer of e.g. BertForTokenClassification with a manual seed 🤗Transformers	2	8016	October 6, 2020

Why do GPT2 initialize the weights of residual layers?

Related topics