For analyses similar to the one in the supplemental proceedings of the original wav2vec 2.0 paper about what kinds of phonetic/phonemic information are learned by self-supervised pre-trained models (screenshot below from Appendix D), I’m hoping to extract information related to the quantization process of the pre-trained wav2vec 2.0 and HuBERT models.
For wav2vec 2.0, I’ve been able to modify the
Wav2Vec2GumbelVectorQuantizer class so that the forward function returns codebook-related information (see implementation here). Basically, for a given audio file with
N frames, I can get the
N x 2 matrix where the value in each column is the corresponding index in one of the two codebooks for a given speech frame:
get_codebook_indices("test.wav") # [[ 54 269] # [146 284] # [ 28 18] # ... ... # [118 111] # [146 252]]
Looking at the
modelling_hubert.py code, it’s less clear where I might intercept information about which of the ‘hidden units’ each of the speech frames as assigned to. Any pointers would be appreciated! (@patrickvonplaten?)