Issues with Whisper Encoder: Positional Encoding

sahuamrit · November 15, 2022, 11:22am

The positional embedding in hugging face implementation of whisper is a nn.Embedding layer (learnable) as opposed to OpenAI’s implementation of a standard sinusoid positional embedding.
The input to the model is assumed to be padded to 3000 frames (both in hugging face and OpenAI’s implementation). Isn’t it non-optimal? Rather, can we pad it to the largest sequence in a batch?

@sanchit-gandhi can you help here?

sanchit-gandhi · November 15, 2022, 11:56am

Hey @sahuamrit,

The models are equivalent - maybe @ArthurZ can shed light on why we use a learnable embed layer in Transformers?
This is how OpenAI trained the model (pad/truncate audio inputs to 30s, then compute log-Mel filter bank features). You can read here why this works: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers. Training the model by padding/truncating to max length would require an attention mask, which is not how the model is trained. Hence, we pad/truncate to a fixed length.

ArthurZ · November 15, 2022, 3:09pm

Hey!

I think it was a choice of simplicity; it is equivalent at inference but might indeed be problematic for training and should be frozen. We can open an issue on this to use register_buffer.
Nice catch

sahuamrit · November 16, 2022, 5:25am

Hi @ArthurZ ,

I dont think it will be same even during inference. When you load official whisper weights, self.positional_encoding won’t have anything and will be random.

I think since the demo had to do with finetuning, the nn.Embedding got learnt.

ArthurZ · November 16, 2022, 11:06am

Not really sure I follow, the weights are not random for the 2 different positional embeddings.

In the audio encoder we have self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state)). Which uses the sinusoids function that has no randomness.
In the text encoder, the positional_embedding is learned.

Topic		Replies	Views
Pass input_embed to WhisperDecoder 🤗Transformers	0	83	May 22, 2024
Positional Embeddings in Transformer Implementations 🤗Transformers	1	1811	September 3, 2024
Adding prompt / context to Whisper with Huggingface Transformers Models	7	6888	January 20, 2025
Why positional embeddings are implemented as just simple embeddings? Beginners	7	8148	October 27, 2023
Using Padding for ASR models 🤗Transformers	0	328	December 16, 2022

Issues with Whisper Encoder: Positional Encoding

Related topics