Why positional embeddings are implemented as just simple embeddings?

In theory, the trigonometric functions have the ability to generalize beyond positions that are seen at training time. They also allow the model to rely on relative rather than absolute positions, and as such their dot product can be computed more efficiently as shown in the TransformerXL paper.

On the other hand, the learned index embeddings offer more parameters, which might enable the model to learn faster in some situations.

As for many other things, it really depends on your use case :wink:

1 Like