- The positional embedding in hugging face implementation of whisper is a nn.Embedding layer (learnable) as opposed to OpenAI’s implementation of a standard sinusoid positional embedding.
- The input to the model is assumed to be padded to 3000 frames (both in hugging face and OpenAI’s implementation). Isn’t it non-optimal? Rather, can we pad it to the largest sequence in a batch?
@sanchit-gandhi can you help here?
Hi @ArthurZ ,
I dont think it will be same even during inference. When you load official whisper weights, self.positional_encoding won’t have anything and will be random.
I think since the demo had to do with finetuning, the nn.Embedding got learnt.
Not really sure I follow, the weights are not random for the 2 different positional embeddings.
- In the audio encoder we have
self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state)). Which uses the
sinusoids function that has no randomness.
- In the text encoder, the
positional_embedding is learned.