How much fire power are we expected to have in order to fine tune the W2V2 XLSR model?

Just tried running the finetuning code plus some minor modifications on an EC2 instance with a V100 and it just wasn’t enough, even when reducing the batch size.

What are yall’s experiences when using the Wav2Vec2 big models? Especially the XLSR multilingual model?

I am also trying to train XLSR multilingual model. Can you please tell how much bigger your data and training details (elapsed time, epoch, etc…)? I guess that your problem is computational cost when you are saying that it is not enough.

I forgot to update this thread. My main problem was the improper segmentation of some audio files. Once I did that correctly it ran just fine.

I’m fine-tuning the model to a very small amount of data so these numbers might probably not mean much to you (on a V100):

  • Batch size of 32
  • 16kHz sampling rate
  • Mixed precision
  • ~2mins per epoch
1 Like

We are organizing a “fine-tuning XLSR-53” event. Check this announcement: [Open-to-the-community] XLSR-Wav2Vec2 Fine-Tuning Week for Low-Resource Languages. Would be awesome if you want to participate :slight_smile:


how you calculated the time per epoch? and for how many epoch did you train… or the total of your train data? also what about the validation set time size?