For analyses similar to the one in the supplemental proceedings of the original wav2vec 2.0 paper about what kinds of phonetic/phonemic information are learned by self-supervised pre-trained models (screenshot below from Appendix D), I’m hoping to extract information related to the quantization process of the pre-trained wav2vec 2.0 and HuBERT models.
For wav2vec 2.0, I’ve been able to modify the Wav2Vec2GumbelVectorQuantizer
class so that the forward function returns codebook-related information (see implementation here). Basically, for a given audio file with N
frames, I can get the N x 2
matrix where the value in each column is the corresponding index in one of the two codebooks for a given speech frame:
get_codebook_indices("test.wav")
# [[ 54 269]
# [146 284]
# [ 28 18]
# ... ...
# [118 111]
# [146 252]]
Looking at the modelling_hubert.py
code, it’s less clear where I might intercept information about which of the ‘hidden units’ each of the speech frames as assigned to. Any pointers would be appreciated! (@patrickvonplaten?)
Thanks!