Choose number of labels in WhisperForAudioClassification

I would like to use Whisper to fine tune it for a binary audio classification task, so I’m using the WhisperForAudioClassification class and loading the pre-trained model like this:

WhisperForAudioClassification.from_pretrained("openai/whisper-tiny")

The resulting architecture is as follows

WhisperForAudioClassification(
  (encoder): WhisperEncoder(
    (conv1): Conv1d(80, 384, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(384, 384, kernel_size=(3,), stride=(2,), padding=(1,))
    (embed_positions): Embedding(1500, 384)
    (layers): ModuleList(
      (0-3): 4 x WhisperEncoderLayer(
        (self_attn): WhisperAttention(
          (k_proj): Linear(in_features=384, out_features=384, bias=False)
          (v_proj): Linear(in_features=384, out_features=384, bias=True)
          (q_proj): Linear(in_features=384, out_features=384, bias=True)
          (out_proj): Linear(in_features=384, out_features=384, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
        (activation_fn): GELUActivation()
        (fc1): Linear(in_features=384, out_features=1536, bias=True)
        (fc2): Linear(in_features=1536, out_features=384, bias=True)
        (final_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
  )
  (projector): Linear(in_features=384, out_features=256, bias=True)
  (classifier): Linear(in_features=256, out_features=2, bias=True)
)

The issue here is that the last layer outputs two labels. How do I change the last layer so that it outputs just a single one? I have already tried to include a keyword argument num_labels=1 but it seems to get ignored.

The documentation talks about a config argument that can be added to the from_pretrained method that could solve my problem, but I’m not sure what to put there exactly.

After fiddling around, I found a possible solution:

1 - Copy config from the Whisper model:

config = WhisperForAudioClassification.from_pretrained('openai/whisper-tiny').config

2 - Edit the id2label parameter:

config.id2label
>>> {0: 'LABEL_0', 1: 'LABEL_1'}

config.id2label = {0: 'prob'}

3 - Re-download the model with the new config:

model = WhisperForAudioClassification.from_pretrained('openai/whisper-tiny', config=config)

This is the result, there is just one output as expected.

WhisperForAudioClassification(
  (encoder): WhisperEncoder(
    (conv1): Conv1d(80, 384, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(384, 384, kernel_size=(3,), stride=(2,), padding=(1,))
    (embed_positions): Embedding(1500, 384)
    (layers): ModuleList(
      (0-3): 4 x WhisperEncoderLayer(
        (self_attn): WhisperAttention(
          (k_proj): Linear(in_features=384, out_features=384, bias=False)
          (v_proj): Linear(in_features=384, out_features=384, bias=True)
          (q_proj): Linear(in_features=384, out_features=384, bias=True)
          (out_proj): Linear(in_features=384, out_features=384, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
        (activation_fn): GELUActivation()
        (fc1): Linear(in_features=384, out_features=1536, bias=True)
        (fc2): Linear(in_features=1536, out_features=384, bias=True)
        (final_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
      )
    )
    (layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
  )
  (projector): Linear(in_features=384, out_features=256, bias=True)
  (classifier): Linear(in_features=256, out_features=1, bias=True)
)

Just use num_labels when defining, eg:

model = WhisperForAudioClassification.from_pretrained(model_name, num_labels=2)