[Whisper] Help me understand output_attentions & Whisper's attention mechanism options

I’m trying to understand why SDPA and Flash Attention is incompatible with output_attentions.

I’m trying to improve performance of my Whisper setup, and want to try one of these attention mechanisms instead of eager, but for my application, I need word-level timestamps, which seems to only work on ‘eager’ attention?

It seems like in the code, return_token_timestamps sets output_attentions to True, is that necessary for the mechanism to output token timestamps? I couldn’t trace what that did in this case, maybe someone can help.

In WhisperSdpaAttention and WhisperFlashAttention2, it short circuits saying that ‘output_attentions’ isn’t compatible with this approach, but as far as I can tell, the ‘eager’ implementation doesn’t do anything special for output_attentions, so what makes these two incompatible with it?

I thought ‘output_attentions’ just collects the attentions and returns them, why would that interfere with these other attention mechanisms?

If anyone can help me understand this better, I’d appreciate it!