Why OPT's token embeddings are not scaled by sqrt(dim) as in the original OPT implementation?

In the original OPT implementation from metaseq, token embeddings are multiplied by sqrt(embedding_dim):

However, I couldn’t find this scaling in HF’s implementation. Is there a reason for not having it?

Hi,

The self.embed_scale attribute is set to 1.0, hence no scaling is happening (args.no_scale_embedding is set to True). See:

It seems that no_scale_embeddings is set to False here:

and here:

Or is it overwritten by a config file?

Pinging @ybelkada here