The reason that we are using PE(Position Embedding) is that in Word Embedding methods like W2V we miss the position of token in a sequence. So, we need a method to include positions in embeddings. The most simple one is that we can add the normalized positions to word embedding vector. But, how the authors of Transformers come into conclusion that the combination of Sine and Cosine works in this situation? What are the goals in PE methods?
You can read section 3.5 in the original Transformer paper. I don’t see a real motivation. They do compare it with learnt representations and claim that results are similar, so the choice for parameterless embeddings is obvious. However, in future work, models like BERT and so on typically do use learnt PE.
from Kazemnejad :
The first idea that might come to mind is to assign a number to each time-step within the [0, 1] range in which 0 means the first word and 1 is the last time-step. Could you figure out what kind of issues it would cause? One of the problems it will introduce is that you can’t figure out how many words are present within a specific range. In other words, time-step delta doesn’t have consistent meaning across different sentences.