I would like to use Whisper to fine tune it for a binary audio classification task, so I’m using the WhisperForAudioClassification
class and loading the pre-trained model like this:
WhisperForAudioClassification.from_pretrained("openai/whisper-tiny")
The resulting architecture is as follows
WhisperForAudioClassification(
(encoder): WhisperEncoder(
(conv1): Conv1d(80, 384, kernel_size=(3,), stride=(1,), padding=(1,))
(conv2): Conv1d(384, 384, kernel_size=(3,), stride=(2,), padding=(1,))
(embed_positions): Embedding(1500, 384)
(layers): ModuleList(
(0-3): 4 x WhisperEncoderLayer(
(self_attn): WhisperAttention(
(k_proj): Linear(in_features=384, out_features=384, bias=False)
(v_proj): Linear(in_features=384, out_features=384, bias=True)
(q_proj): Linear(in_features=384, out_features=384, bias=True)
(out_proj): Linear(in_features=384, out_features=384, bias=True)
)
(self_attn_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(final_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
(projector): Linear(in_features=384, out_features=256, bias=True)
(classifier): Linear(in_features=256, out_features=2, bias=True)
)
The issue here is that the last layer outputs two labels. How do I change the last layer so that it outputs just a single one? I have already tried to include a keyword argument num_labels=1
but it seems to get ignored.
The documentation talks about a config
argument that can be added to the from_pretrained
method that could solve my problem, but I’m not sure what to put there exactly.