I’m working on WhisperModel.from_pretrained(“openai/whisper-base”) and I have a question that may not
I have extracted the hidden layer outputs via forward hook on the final_layer_norm of the encoder blocks and also from the built-in encoder_hidden_states (out.encoder_hidden_states) of the huggingface model. Comparing the 2 together I’m getting different outputs. Could anyone explain what makes the two different? Is the residual connection and layer normalization not applied to one of them?
Also, considering that the first vector of the out.encoder_hidden_states is the output of the positional embeddings if I’m correct.
hidden_states_hf = [None] * hidden_layer_size for i, block in enumerate(model.encoder.layers): block.final_layer_norm.register_forward_hook( lambda _, inputs, outputs, index=i: hidden_states_hf.__setitem__(index, outputs[-1]) ) tokens= torch.tensor([[1, 1]]) * model.config.decoder_start_token_id with torch.no_grad(): model.eval() out = model(mel, decoder_input_ids=tokens, output_attentions=True, output_hidden_states=True)