Bemba ASR: Fine-Tuning Wav2Vec2

Hello everybody, I’m planning on fine-tuning XLSR-Wav2Vec2 in Bemba. I’m happy to collaborate with anybody willing to join me. So I have created this thread so that we can share and discuss issues here. :slight_smile:

Summary dataset details:

  • Language: Bemba (or Icibemba) language of Zambia
  • Dataset: BembaSpeech (If interested, you can check out the paper for more details).
  • Duration: The dataset has the total duration of 24hrs of read speech already preprocessed and partitioned into train, dev and test sets.
  • Size: 2.8Gb
  • Subset [optional]: There is also a 17hrs subset of the BembaSpeech here consisting of audio files less than 10 seconds.


So far, I just quickly tried to fine-tune on the 17hrs subset using the parameters that came with @patrickvonplaten `s notebook but ran into vanishing/exploding problem. So yeah, need to twerk a few parameters. So get in touch if you are willing to join in… I`m happy to collaborate with anyone in the community.



Update on my trainings progress:

So I have been training using the 17hrs subset [optional] of the BembaSpeech: train, dev and test.

To test the waters, first I trained using the dev and test sets only. The trainign went on without a problem. However, when I decided to include the training and evaluate on the dev set… I started getting the nan results.

Hey Claytone, I would suggest to play around a bit with learning_rate and dropout. I’d both try to reduce and increase the learning rate…and reduce dropout if you keep getting nan for the training loss

1 Like

Thank you @patrickvonplaten. I will try that too. Is there a restriction to what the model accepts as maximum and minimum durations (length) of the audio files? Just in case…

There is no real restriction, but for input samples that have a duration longer than ~2 minutes you might get “out-of-memory” errors. See here a related issue: can't allocate memory error with wav2vec2 · Issue #10366 · huggingface/transformers · GitHub