Flash attention has no effect on inference

kreas · February 29, 2024, 1:27am

Changing transformers version doesn’t seem to affect anything;

I found out that I had some NVTX calls which added overhead, but even after removing, Flash attention is slower on mistral. I’ve slightly modified the script you linked in your comment. These are the results:

Topic		Replies	Views
Fine tuned Mistral 7B inference issue for >4k context length token with transformer 4.35+ 🤗Transformers	0	550	December 11, 2023
Mixtral batch inference or in general fast inference Beginners	2	3849	February 26, 2024
Fine-tuning Mistral/Mixtral for sequence classification on long context Intermediate	2	2605	May 29, 2024
Flash Attention 2 Error on Mistral Based Model Beginners	0	611	December 18, 2023
A question about code on Mistral-7B attention 🤗Transformers	0	75	July 26, 2024

Flash attention has no effect on inference

Related topics