I wanted to know if the MultiQuery Attention implemented in GPTBigCodeModel is actually Flash Attention? I think it is plain MQA but the paper says that they used Flash Attention. I am a bit confused.
You are right. It is a plain MQA implementation in the transformers version.
Flash attention is implemented in the Megatron-LM version.
(Megatron-LM/megatron/model/transformer.py at multi-query-attention · bigcode-project/Megatron-LM · GitHub)