Collapsing Wav2Vec2 pretraining loss

I’m trying to pretrain a Wav2Vec2 model based on the example given here`.

I was initially getting a contrastive loss like the graph on the left which seemed very slow so I upped the learning rate and got the graph on the right after only a few steps.

I’m not familiar with the nuts and bolts of contrastive loss but this came as a bit of a surprise and I was wondering if anyone could help me understand.

The batch size (with accumulation) is 32, the number of epochs is 20 and the warmup steps is 1200 for both attempts.

1 Like

The solution in the end was to set return_attention_mask to True in the feature extractor, or use a pretrained feature extractor and model that prefers attention masks (i.e. not wav2vec2-base).

1 Like

Hi, thank you for sharing your experience and solution. Concerning the attention mask, do you input to the model directly the attention mask given by the feature extractor (at sample level and not at frame level right?) ? Are you able to check the shapes?

Thanks