Is it possible to use WhisperModel for an audio classification task?

I wonder if it is possible to use the WhisperModel for transfer learning for a speech classification task.

If possible, I would like to know how to connect a classification head to the outputs of the model.

For example, the following output has four keys:

from transformers import AutoModel

model = AutoModel.from_pretrained("openai/whisper-small")
output = model(
    torch.tensor(features['input_features']),  
    decoder_input_ids=torch.tensor([[1, 1]]) * model.config.decoder_start_token_id,
    output_hidden_states=True
)

# odict_keys(['last_hidden_state', 'past_key_values', 'decoder_hidden_states', 'encoder_last_hidden_state', 'encoder_hidden_states'])

If I can connect a classification head, to which output should I connect one?
Also, what are these outputs and how do they relate to the blocks in the diagram in the paper?

Thanks!