Improving performance of Wav2Vec2 fine tuning with word piece vocabulary

SamuelAzran · May 21, 2021, 1:31pm

Hello,

I’m fine tuning XLSR-Wav2Vec2 on a 200+ hours of a speech in a language not in the original pertaining.

The training progresses nicely, however when it reaches about 40 WER it starts to overfit (WER doesn’t progress much and train loss decreases while eval loss is going up).

I’ve tried increasing some params of the SpecAugment, but it only helped a bit.

I’ve noticed that using the Speechbrain lib implementation I’m getting a bit better results (on the expense of training stability) and was wondering if it is due to a larger vocabulary they use there. Does anyone tried to use a tokenizer with a vocabulary that contains subwords and words in addition to characters? I could’t find any experiment that uses it with Huggingface transformers W2V2.

I see in the Wav2Vec 2 paper they say that:

We expect performance gains by switching to a seq2seq architecture and a
word piece vocabulary.
https://arxiv.org/pdf/2006.11477.pdf

Any suggestions on how to do that with Huggingface Transformers?

P.S. my dataset is noisy and not super clean.

Any help or suggestion will be very helpful.

Samuel

tadf · May 26, 2021, 7:07am

Not sure how I’d switch to a seq2seq architecture, but for word piece, I think you just need to change the vocab passed to the Wav2Vec2CTCTokenizer. Instead of the individual alphabet characters used for the vocab in the XLSR example, you’d need to use the wordpiece/BPE algorithm on your language text data and pass that through.

SamuelAzran · May 28, 2021, 1:49pm

Thanks for the answer!
Any code examples or ideas on how to use word piece tokenizer easily? I understand I’ll need to basically override most of the functions in transformers/models/wav2vec2/tokenization_wav2vec2.py

tadf · June 3, 2021, 7:00am

you can look into sentencepiece.
Hope that helps!

wrice · July 30, 2021, 5:19pm

This can be accomplished by using the BertTokenizer and setting vocab_size to 30522. Keep in mind that you don’t want to use the existing lm_head weights in the Wav2Vec2ForCTC checkpoint though. I did this with the TensorFlow version, but I don’t think there is a vocab limit on the PyTorch ctc loss either.

noskid · October 27, 2021, 3:02am

Thanks for the answer!
I am also trying to implement this. Can I get any code examples for this? Thank you.

Topic		Replies	Views
Vocabulary count mismatch when loading the previously created tokenizer 🤗Transformers	0	168	January 8, 2024
Boosting Wav2Vec2-xls-r with an N gram decoder using the transcripts used to train wav2vec2 Models	1	984	July 26, 2022
Wav2Vec2: loss growing in training and validation after few epochs Models	6	2044	September 25, 2024
Wav2vec2 not converging when finetuning 🤗Transformers	7	2535	June 15, 2021
Thai ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	0	1022	March 18, 2021

Improving performance of Wav2Vec2 fine tuning with word piece vocabulary

Related topics