Audio event embeddings from existing pretrained transformer models

I want to use existing pretrained audio transformer models as audio event embedding extractor. This way I can generate latent feature representations of my input audio event (say time domain audio or spectrogram).
All examples in the hugging face is either to do inferencing on a given audio or fine tune the transformer based classifier.

Any links to examples where we get embeddings (encoder outputs) , which are the latent space representations of the input before its used in the classifier?

@reach-vb @osanseviero any leads would be helpful.