Issues with Whisper Encoder: Positional Encoding

  1. The positional embedding in hugging face implementation of whisper is a nn.Embedding layer (learnable) as opposed to OpenAI’s implementation of a standard sinusoid positional embedding.
  2. The input to the model is assumed to be padded to 3000 frames (both in hugging face and OpenAI’s implementation). Isn’t it non-optimal? Rather, can we pad it to the largest sequence in a batch?

@sanchit-gandhi can you help here?

Hey @sahuamrit,

  1. The models are equivalent - maybe @ArthurZ can shed light on why we use a learnable embed layer in Transformers?
  2. This is how OpenAI trained the model (pad/truncate audio inputs to 30s, then compute log-Mel filter bank features). You can read here why this works: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers. Training the model by padding/truncating to max length would require an attention mask, which is not how the model is trained. Hence, we pad/truncate to a fixed length.


  1. I think it was a choice of simplicity; it is equivalent at inference but might indeed be problematic for training and should be frozen. We can open an issue on this to use register_buffer.
    Nice catch :hugs:
1 Like

Hi @ArthurZ ,

I dont think it will be same even during inference. When you load official whisper weights, self.positional_encoding won’t have anything and will be random.

I think since the demo had to do with finetuning, the nn.Embedding got learnt.

Not really sure I follow, the weights are not random for the 2 different positional embeddings.

  • In the audio encoder we have self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state)). Which uses the sinusoids function that has no randomness.
  • In the text encoder, the positional_embedding is learned.