Hi! I’d like to know why the AST (audio spectrogram transformer) model inserts a distillation token in front of the audio flattened patch embeddings, other than the standard [CLS] token. I was wondering why this distillation token is inserted, what is its role considering that in the original AST model this token is not used.
I include the script I’m referring to:
At line 54 there’s the definition of the ASTEmbeddings class where the [CLS] and distillation tokens are created and then used in the forward pass.