Extracting HuBERT hidden units

fauxneticien · July 10, 2022, 6:28pm

For analyses similar to the one in the supplemental proceedings of the original wav2vec 2.0 paper about what kinds of phonetic/phonemic information are learned by self-supervised pre-trained models (screenshot below from Appendix D), I’m hoping to extract information related to the quantization process of the pre-trained wav2vec 2.0 and HuBERT models.

For wav2vec 2.0, I’ve been able to modify the Wav2Vec2GumbelVectorQuantizer class so that the forward function returns codebook-related information (see implementation here). Basically, for a given audio file with N frames, I can get the N x 2 matrix where the value in each column is the corresponding index in one of the two codebooks for a given speech frame:

get_codebook_indices("test.wav")
# [[ 54 269]
#  [146 284]
#  [ 28  18]
#   ... ...
#  [118 111]
#  [146 252]]

Looking at the modelling_hubert.py code, it’s less clear where I might intercept information about which of the ‘hidden units’ each of the speech frames as assigned to. Any pointers would be appreciated! (@patrickvonplaten?)

Thanks!

patrickvonplaten · July 26, 2022, 1:49pm

Hey @fauxneticien,

This is currently indeed difficult because we haven’t implemented and verified HUBERT’s pretraining algorithm. Note that HUBERT doesn’t use a codebook like Wav2Vec2 does, but instead clusters hidden states for pretraining. Those hidden states don’t necessary have to correspond to acoustic units however.

Topic		Replies	Views
A hypothetical question on multi-headed wav2vec2 / hubert models 🤗Transformers	0	345	December 15, 2021
Cannot train Wav2Vec2 processor with Wav2Vec2 or HuBERT Beginners	3	383	July 17, 2024
Hubert ASR Fine Tuning giving weird results Models	1	1335	January 14, 2022
Getting embeddings from wav2vec2 models Beginners	2	1412	October 20, 2023
Does HuBERT need text as well as audio for fine-tuning? / How to achieve sub-5% WER? Beginners	4	3947	March 18, 2022

Extracting HuBERT hidden units

Related topics