Extracting HuBERT hidden units

For analyses similar to the one in the supplemental proceedings of the original wav2vec 2.0 paper about what kinds of phonetic/phonemic information are learned by self-supervised pre-trained models (screenshot below from Appendix D), I’m hoping to extract information related to the quantization process of the pre-trained wav2vec 2.0 and HuBERT models.

For wav2vec 2.0, I’ve been able to modify the Wav2Vec2GumbelVectorQuantizer class so that the forward function returns codebook-related information (see implementation here). Basically, for a given audio file with N frames, I can get the N x 2 matrix where the value in each column is the corresponding index in one of the two codebooks for a given speech frame:

# [[ 54 269]
#  [146 284]
#  [ 28  18]
#   ... ...
#  [118 111]
#  [146 252]]

Looking at the modelling_hubert.py code, it’s less clear where I might intercept information about which of the ‘hidden units’ each of the speech frames as assigned to. Any pointers would be appreciated! (@patrickvonplaten?)


Hey @fauxneticien,

This is currently indeed difficult because we haven’t implemented and verified HUBERT’s pretraining algorithm. Note that HUBERT doesn’t use a codebook like Wav2Vec2 does, but instead clusters hidden states for pretraining. Those hidden states don’t necessary have to correspond to acoustic units however.