The input to the model is assumed to be padded to 3000 frames (both in hugging face and OpenAI’s implementation). Isn’t it non-optimal? Rather, can we pad it to the largest sequence in a batch?
The models are equivalent - maybe @ArthurZ can shed light on why we use a learnable embed layer in Transformers?
This is how OpenAI trained the model (pad/truncate audio inputs to 30s, then compute log-Mel filter bank features). You can read here why this works: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers. Training the model by padding/truncating to max length would require an attention mask, which is not how the model is trained. Hence, we pad/truncate to a fixed length.
I think it was a choice of simplicity; it is equivalent at inference but might indeed be problematic for training and should be frozen. We can open an issue on this to use register_buffer.
Nice catch
I dont think it will be same even during inference. When you load official whisper weights, self.positional_encoding won’t have anything and will be random.
I think since the demo had to do with finetuning, the nn.Embedding got learnt.
Not really sure I follow, the weights are not random for the 2 different positional embeddings.
In the audio encoder we have self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state)). Which uses the sinusoids function that has no randomness.
In the text encoder, the positional_embedding is learned.