Collapsing Wav2Vec2 pretraining loss

I’m trying to pretrain a Wav2Vec2 model based on the example given here`.

I was initially getting a contrastive loss like the graph on the left which seemed very slow so I upped the learning rate and got the graph on the right after only a few steps.

I’m not familiar with the nuts and bolts of contrastive loss but this came as a bit of a surprise and I was wondering if anyone could help me understand.

The batch size (with accumulation) is 32, the number of epochs is 20 and the warmup steps is 1200 for both attempts.

1 Like

The solution in the end was to set return_attention_mask to True in the feature extractor, or use a pretrained feature extractor and model that prefers attention masks (i.e. not wav2vec2-base).

1 Like