In the original OPT implementation from metaseq, token embeddings are multiplied by sqrt(embedding_dim):
However, I couldn’t find this scaling in HF’s implementation. Is there a reason for not having it?
In the original OPT implementation from metaseq, token embeddings are multiplied by sqrt(embedding_dim):
However, I couldn’t find this scaling in HF’s implementation. Is there a reason for not having it?
Hi,
The self.embed_scale
attribute is set to 1.0, hence no scaling is happening (args.no_scale_embedding is set to True
). See:
It seems that no_scale_embeddings is set to False here:
and here:
Or is it overwritten by a config file?
Pinging @ybelkada here