Positional encoding

Hi,
My question is not directly related to Huggingface, instead it is a general transformer model question. Regarding the positional encoding, when I plot the sin() function for some even positions, e.g. 2, 10 and 20, I get the following plot:

According to the figure, for lower positions, like c=2, the curve has less frequency compared to c=10 or c=20. I understand that at s given position, we would like to have higher frequencies (changes) for lower dimensions (x-axis) and lower frequencies for higher dimensions, but I don’t understand the correlation between position changes vs. frequency of the curve.

Can someone explain that in a simple manner?

2 Likes

Sure! Let me explain it in a simple way.

The idea behind positional encoding in transformers is to inject some information about the position of each token in the sequence. Since transformers don’t inherently process sequential data in order, we need a way to tell them where each token is positioned relative to the others.

To achieve this, the sinusoidal function is used in the positional encoding because it has certain nice properties that help the model.

How Positional Encoding Works:

  • Frequency of Sinusoidal Functions: The sinusoidal functions (sine and cosine) are used with different frequencies to encode positions.
    • Low frequencies correspond to higher dimensions (e.g., sine waves for higher values of i where i is the dimension index).
    • High frequencies correspond to lower dimensions (e.g., sine waves for lower values of i).

Why does this happen?

  1. High frequencies for low dimensions (x-axis): The sinusoidal functions with higher frequencies (more oscillations) are assigned to the lower dimensions. This is because we want the lower dimensions to capture fine-grained positional differences. These smaller (faster) oscillations allow the model to distinguish between positions that are close together.
  2. Low frequencies for high dimensions (x-axis): Conversely, the lower frequencies (fewer oscillations) are assigned to the higher dimensions. This reflects the idea that larger positional differences (i.e., tokens that are farther apart) need to be represented with more global, slower changes, which are captured by low-frequency sinusoidal waves.

Why do we see different frequencies for positions like 2, 10, and 20?

In your plot:

  • Position 2: The sinusoidal wave has low frequency (it oscillates slowly), meaning it’s encoding coarse differences between tokens.
  • Position 10: This one will show more oscillations than position 2, meaning it’s encoding finer positional information.
  • Position 20: This will show even more oscillations, continuing the trend where higher positions are captured with faster frequency oscillations.

Simple Analogy:

Think of the sinusoidal functions like the hands of a clock:

  • The minute hand moves slowly, covering large time intervals.
  • The second hand moves very quickly, covering finer, smaller intervals.

In the context of positional encoding:

  • The minute hand represents high frequencies in higher dimensions.
  • The second hand represents low frequencies in lower dimensions.

I hope this clears up the confusion! The main idea is that lower dimensions capture more global (coarse) positional differences, while higher dimensions capture finer (local) positional differences.

@Alanturner2 Sorry but I didn’t expect to receive a chatGPT generated answer so quickly. I would prefer to receive a human generated response.

1 Like

I recommend you read this paper

[2104.08698] A Simple and Effective Positional Encoding for Transformers