State of the art technique for initializing Embedding Matrix?

kushaj · July 16, 2020, 5:55pm

What are your thoughts on the state-of-the-art technique for initializing Embedding Weight matrices? Currently, PyTorch uses normal distribution to initialize these. Does using Kaiming Init make more sense?

BramVanroy · July 16, 2020, 6:40pm

From what I remember, Transformer modules should use Xavier init by default. I don’t remember the reason why, though, nor whether Kaiming is a better choice.

kushaj · July 17, 2020, 9:46am

Transformer uses xavier.
So using Kaiming init for Embedding matrix is preferred for RNN? In case of transformer Xavier is preferred? Am I correct to say this?

RichardWang · July 19, 2020, 12:18pm

Based on init_weights of bert , bert normalize linear and embedding with mean 0 and 0.2 std.

BTW, I tried to use kaiming (Pytorch default initialization) on Linear and embedding, on my toy task with 2 layer transformer. And it gives slightly better performance. I won’t say it is better than xavier surely. But it is definitely worth trying.

Topic		Replies	Views
Can we resize embedding with embedding weighted initialized differently? 🤗Transformers	0	1370	August 18, 2020
Getting random results with BERT 🤗Transformers	3	924	April 27, 2021
What is the `tie_word_embeddings` option exactly doing? 🤗Transformers	3	13904	October 15, 2022
Initializing the weights of the final layer of e.g. BertForTokenClassification with a manual seed 🤗Transformers	2	8062	October 6, 2020
Should gpt-j-6B model's embedding layer have bias? 🤗Transformers	0	416	July 20, 2022

State of the art technique for initializing Embedding Matrix?

Related topics