Resources on interpretability of wav2vec-style speech models

Hello everyone

Big thanks to HuggingFace for creating this amazing framework, and the active community as well! I’ve been using huggingface for a while now and been reading this forum as well.

I am working on multi-lingual speech models and am interested in understanding how the pre-trained wav2vec-style models represent input utterances (from a phonetics perspective if possible). For example, I would like to know how Language Identification Model like “VoxLingua107 Wav2Vec Spoken Language Identification Model” goes about representing a collection of short utterances in English vs. say Thai.

The most straight-forward method I know is to take final layer output embeddings (in inference mode) and to use t-SNE to cluster. But this doesn’t seem to help as much.

I am looking for literature, codes, frameworks (like Captum) and tutorials which use wav2vec-style models and focus on interpretability. Please help. Thank you