Whisper: Forward Hook on final_layer_norm vs out.encoder_hidden_states

Hi everyone,
I’m working on WhisperModel.from_pretrained(“openai/whisper-base”) and I have a question that may not

I have extracted the hidden layer outputs via forward hook on the final_layer_norm of the encoder blocks and also from the built-in encoder_hidden_states (out.encoder_hidden_states) of the huggingface model. Comparing the 2 together I’m getting different outputs. Could anyone explain what makes the two different? Is the residual connection and layer normalization not applied to one of them?

Also, considering that the first vector of the out.encoder_hidden_states is the output of the positional embeddings if I’m correct.

Sample code:

hidden_states_hf = [None] * hidden_layer_size
for i, block in enumerate(model.encoder.layers):
    block.final_layer_norm.register_forward_hook(
        lambda _, inputs, outputs, index=i: hidden_states_hf.__setitem__(index, outputs[-1])
    )

tokens= torch.tensor([[1, 1]]) * model.config.decoder_start_token_id

with torch.no_grad():
    model.eval()
    out = model(mel, decoder_input_ids=tokens, output_attentions=True,
                        output_hidden_states=True)