Why do GPT2 initialize the weights of residual layers?

I had a question while reading the paper gpt2.

image

Isn’t the residual layer a block that stores the value to transfer the output value of the previous layer to the later layer, not the value to be learned?

This value is the output of a specific layer, but I don’t know why this layer need to initialized.

Also, it is said to be scaled as 1/root(N),

 p.data.normal_(mean=0.0, std=(self.config.initializer_range / math.sqrt(2 * self.config.n_layer)))

but I don’t know what the difference is from multiplying the previous layer value by 1/root(N) without initialization.

1 Like