I wonder if it is possible to use the WhisperModel for transfer learning for a speech classification task.
If possible, I would like to know how to connect a classification head to the outputs of the model.
For example, the following output has four keys:
from transformers import AutoModel
model = AutoModel.from_pretrained("openai/whisper-small")
output = model(
torch.tensor(features['input_features']),
decoder_input_ids=torch.tensor([[1, 1]]) * model.config.decoder_start_token_id,
output_hidden_states=True
)
# odict_keys(['last_hidden_state', 'past_key_values', 'decoder_hidden_states', 'encoder_last_hidden_state', 'encoder_hidden_states'])
If I can connect a classification head, to which output should I connect one?
Also, what are these outputs and how do they relate to the blocks in the diagram in the paper?
Thanks!