I’m trying to understand why SDPA and Flash Attention is incompatible with output_attentions.
I’m trying to improve performance of my Whisper setup, and want to try one of these attention mechanisms instead of eager, but for my application, I need word-level timestamps, which seems to only work on ‘eager’ attention?
It seems like in the code, return_token_timestamps
sets output_attentions
to True, is that necessary for the mechanism to output token timestamps? I couldn’t trace what that did in this case, maybe someone can help.
In WhisperSdpaAttention and WhisperFlashAttention2, it short circuits saying that ‘output_attentions’ isn’t compatible with this approach, but as far as I can tell, the ‘eager’ implementation doesn’t do anything special for output_attentions, so what makes these two incompatible with it?
I thought ‘output_attentions’ just collects the attentions and returns them, why would that interfere with these other attention mechanisms?
If anyone can help me understand this better, I’d appreciate it!