What are the goals in Positional Embedding methods?

MahdiA · March 2, 2022, 11:50pm

The reason that we are using PE(Position Embedding) is that in Word Embedding methods like W2V we miss the position of token in a sequence. So, we need a method to include positions in embeddings. The most simple one is that we can add the normalized positions to word embedding vector. But, how the authors of Transformers come into conclusion that the combination of Sine and Cosine works in this situation? What are the goals in PE methods?

BramVanroy · March 3, 2022, 8:33am

You can read section 3.5 in the original Transformer paper. I don’t see a real motivation. They do compare it with learnt representations and claim that results are similar, so the choice for parameterless embeddings is obvious. However, in future work, models like BERT and so on typically do use learnt PE.

MahdiA · March 3, 2022, 1:33pm

from Kazemnejad :

Blockquote
The first idea that might come to mind is to assign a number to each time-step within the [0, 1] range in which 0 means the first word and 1 is the last time-step. Could you figure out what kind of issues it would cause? One of the problems it will introduce is that you can’t figure out how many words are present within a specific range. In other words, time-step delta doesn’t have consistent meaning across different sentences.

Topic		Replies	Views
Why positional embeddings are implemented as just simple embeddings? Beginners	7	8181	October 27, 2023
Why we add math to word embedding 🤗Transformers	0	262	March 13, 2022
Issues with Whisper Encoder: Positional Encoding 🤗Transformers	4	1616	November 16, 2022
Positional Embeddings in Transformer Implementations 🤗Transformers	1	1824	September 3, 2024
`BertEmbeddings` contains positional embedding? 🤗Transformers	2	3177	December 27, 2022

What are the goals in Positional Embedding methods?

Related topics