I can’t figure out why the positional embeddings are implemented as just the vanilla Embedding layer in both PyTorch and Tensorflow. Based on my current understanding, positional embeddings should be implemented as non-trainable sin/cos or axial positional encodings (from reformer).
Can anyone please enlighten me with this? Thank you so much!
Hi @miguelvictor ! Both are valid strategies: iirc the original Transformers paper had sinusoidal embeddings with a fixed rate, but BERT learned a full vector for each of the 512 expected positions.
Currently, the Transformers library has sinusoidal embeddings in the TransfoXL model, check it out!
Thanks @yjernite - curious, what are the pros and cons of both approaches? When should one choose to use trigonometric functions vs applying an nn.Embedding on an ordinal index?
It’s not always clear what the “expected positions” for a specific task, so no strong guarantees In general, if your model relies on absolute positions with a fixed range, you should be fine with learned positional embeddings.
In theory, the trigonometric functions have the ability to generalize beyond positions that are seen at training time. They also allow the model to rely on relative rather than absolute positions, and as such their dot product can be computed more efficiently as shown in the TransformerXL paper.
On the other hand, the learned index embeddings offer more parameters, which might enable the model to learn faster in some situations.
As for many other things, it really depends on your use case
Just curious: Did the previous implementations of Bert from HuggingFace used any sin/cos in the positional embedding? It sounds like the vanilla Embedding layer was added later since it looks like it has more pros in replace of sin/cos positional embeddings.