State of the art technique for initializing Embedding Matrix?

Based on init_weights of bert , bert normalize linear and embedding with mean 0 and 0.2 std.

BTW, I tried to use kaiming (Pytorch default initialization) on Linear and embedding, on my toy task with 2 layer transformer. And it gives slightly better performance. I won’t say it is better than xavier surely. But it is definitely worth trying.

1 Like