A hypothetical question on multi-headed wav2vec2 / hubert models

There are a number of models in the registry, that uses Wav2vec2 or Hubert as base models for e.g. classification. Since the original motivation for Wav2vec has been ASR, I’m wondering if there had been any work on combining these two, e.g. audio in and [class_label, transcription] out?

Is it a viable idea?

