State of the art technique for initializing Embedding Matrix?

What are your thoughts on the state-of-the-art technique for initializing Embedding Weight matrices? Currently, PyTorch uses normal distribution to initialize these. Does using Kaiming Init make more sense?

From what I remember, Transformer modules should use Xavier init by default. I don’t remember the reason why, though, nor whether Kaiming is a better choice.

2 Likes

Transformer uses xavier.
So using Kaiming init for Embedding matrix is preferred for RNN? In case of transformer Xavier is preferred? Am I correct to say this?

Based on init_weights of bert , bert normalize linear and embedding with mean 0 and 0.2 std.

BTW, I tried to use kaiming (Pytorch default initialization) on Linear and embedding, on my toy task with 2 layer transformer. And it gives slightly better performance. I won’t say it is better than xavier surely. But it is definitely worth trying.

1 Like