Positional encoding

Mahmoodn · December 16, 2024, 3:38pm

Hi,
My question is not directly related to Huggingface, instead it is a general transformer model question. Regarding the positional encoding, when I plot the sin() function for some even positions, e.g. 2, 10 and 20, I get the following plot:

According to the figure, for lower positions, like c=2, the curve has less frequency compared to c=10 or c=20. I understand that at s given position, we would like to have higher frequencies (changes) for lower dimensions (x-axis) and lower frequencies for higher dimensions, but I don’t understand the correlation between position changes vs. frequency of the curve.

Can someone explain that in a simple manner?

Alanturner2 · December 16, 2024, 3:42pm

Sure! Let me explain it in a simple way.

The idea behind positional encoding in transformers is to inject some information about the position of each token in the sequence. Since transformers don’t inherently process sequential data in order, we need a way to tell them where each token is positioned relative to the others.

To achieve this, the sinusoidal function is used in the positional encoding because it has certain nice properties that help the model.

How Positional Encoding Works:

Frequency of Sinusoidal Functions: The sinusoidal functions (sine and cosine) are used with different frequencies to encode positions.
- Low frequencies correspond to higher dimensions (e.g., sine waves for higher values of i where i is the dimension index).
- High frequencies correspond to lower dimensions (e.g., sine waves for lower values of i).

Why does this happen?

High frequencies for low dimensions (x-axis): The sinusoidal functions with higher frequencies (more oscillations) are assigned to the lower dimensions. This is because we want the lower dimensions to capture fine-grained positional differences. These smaller (faster) oscillations allow the model to distinguish between positions that are close together.
Low frequencies for high dimensions (x-axis): Conversely, the lower frequencies (fewer oscillations) are assigned to the higher dimensions. This reflects the idea that larger positional differences (i.e., tokens that are farther apart) need to be represented with more global, slower changes, which are captured by low-frequency sinusoidal waves.

Why do we see different frequencies for positions like 2, 10, and 20?

In your plot:

Position 2: The sinusoidal wave has low frequency (it oscillates slowly), meaning it’s encoding coarse differences between tokens.
Position 10: This one will show more oscillations than position 2, meaning it’s encoding finer positional information.
Position 20: This will show even more oscillations, continuing the trend where higher positions are captured with faster frequency oscillations.

Simple Analogy:

Think of the sinusoidal functions like the hands of a clock:

The minute hand moves slowly, covering large time intervals.
The second hand moves very quickly, covering finer, smaller intervals.

In the context of positional encoding:

The minute hand represents high frequencies in higher dimensions.
The second hand represents low frequencies in lower dimensions.

I hope this clears up the confusion! The main idea is that lower dimensions capture more global (coarse) positional differences, while higher dimensions capture finer (local) positional differences.

Mahmoodn · December 16, 2024, 3:47pm

@Alanturner2 Sorry but I didn’t expect to receive a chatGPT generated answer so quickly. I would prefer to receive a human generated response.

Alanturner2 · December 16, 2024, 3:49pm

I recommend you read this paper

[2104.08698] A Simple and Effective Positional Encoding for Transformers

Topic		Replies	Views
Why positional embeddings are implemented as just simple embeddings? Beginners	7	8130	October 27, 2023
Positional Embeddings in Transformer Implementations 🤗Transformers	1	1788	September 3, 2024
Postional Encoding calculation for T5 🤗Transformers	0	184	March 16, 2023
What are the goals in Positional Embedding methods? 🤗Transformers	2	503	March 3, 2022
Why we add math to word embedding 🤗Transformers	0	262	March 13, 2022

Positional encoding

How Positional Encoding Works:

Why does this happen?

Why do we see different frequencies for positions like 2, 10, and 20?

Simple Analogy:

Related topics