Train phoneme recognizer using Wav2Vec2 intermediate features

Figure 2 of Unsupervised Speech Recognition suggests that using intermediate blocks from wav2vec 2.0 provides better features for training a phoneme recognizer. I would like to try this hypothesis using HuggingFace API, but I don’t know how to do it.

I tried to do something similar to what is show at in the ASR fine-tuning tutorial, but I couldn’t how to inform the training algorithm which intermediate hidden state it should use. I though that maybe the Trainer class (from transformers module) could be parameterized to use a specific layer output during the fine-tuning, but it does seem possible.

Could anyone please give me some hints about how could I fine-tune a wav2vec 2.0 model using intermediate layer output?

Thank you

1 Like