Is Flash Attention implemented in GPTBigCodeModel?

I wanted to know if the MultiQuery Attention implemented in GPTBigCodeModel is actually Flash Attention? I think it is plain MQA but the paper says that they used Flash Attention. I am a bit confused.

Code Link: transformers/src/transformers/models/gpt_bigcode/ at v4.30.0 · huggingface/transformers · GitHub

You are right. It is a plain MQA implementation in the transformers version.

Flash attention is implemented in the Megatron-LM version.
(Megatron-LM/megatron/model/ at multi-query-attention · bigcode-project/Megatron-LM · GitHub)