I had a question while reading the paper gpt2.
Isn’t the residual layer a block that stores the value to transfer the output value of the previous layer to the later layer, not the value to be learned?
This value is the output of a specific layer, but I don’t know why this layer need to initialized.
Also, it is said to be scaled as 1/root(N),
p.data.normal_(mean=0.0, std=(self.config.initializer_range / math.sqrt(2 * self.config.n_layer)))
but I don’t know what the difference is from multiplying the previous layer value by 1/root(N) without initialization.