I want to use existing pretrained audio transformer models as audio event embedding extractor. This way I can generate latent feature representations of my input audio event (say time domain audio or spectrogram).
All examples in the hugging face is either to do inferencing on a given audio or fine tune the transformer based classifier.
Any links to examples where we get embeddings (encoder outputs) , which are the latent space representations of the input before its used in the classifier?
@reach-vb @osanseviero any leads would be helpful.