Is Flash Attention implemented in GPTBigCodeModel?

dawn17 · June 27, 2023, 8:23am

I wanted to know if the MultiQuery Attention implemented in GPTBigCodeModel is actually Flash Attention? I think it is plain MQA but the paper says that they used Flash Attention. I am a bit confused.

Code Link: transformers/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py at v4.30.0 · huggingface/transformers · GitHub

kernelpanic · July 6, 2023, 11:23pm

You are right. It is a plain MQA implementation in the transformers version.

Flash attention is implemented in the Megatron-LM version.
(Megatron-LM/megatron/model/transformer.py at multi-query-attention · bigcode-project/Megatron-LM · GitHub)

Topic		Replies	Views
Swapping GPT-2 Attention with Flash Attention 🤗Transformers	3	3020	June 4, 2023
Original transformers model implementation Beginners	2	976	June 1, 2022
Implementing GQA Checkpoint Conversion from MHA Research	0	88	July 28, 2024
Small LMs to prototype architecture experiments on Research	2	76	January 27, 2025
Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens Beginners	2	207	June 27, 2024

Is Flash Attention implemented in GPTBigCodeModel?

Related topics