Load Phi 3 small on Nvidia Tesla V100 - Flash Attention

roman174 · July 25, 2024, 10:52am

Hi,
I would like to inquire about the possibility of uploading and fine tuning a Phi 3 8k small. When I load the model, I get an error about missing Flash attention. If I want to install the given package, I get this error :

RuntimeError: FlashAttention is only supported on CUDA 11.6 and above.  Note: make sure nvcc has a supported version by running nvcc -V.


      torch.__version__  = 2.3.1+cu121

But I have the required version of pytorch and CUDA (torch 2.3.1 and cuda 12.1)
Is it because I am using a Tesla V100 graphics card? Is there any way to load the model also with this graphics card?
I found this in the documentation for the Phi 3 mini on Huggingface:

If you want to run the model on:
NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with attn_implementation="eager"

Does this also apply to the Phi3 Small 8k?? Beacause when I try to load it, the error occurs

model = AutoModelForSequenceClassification.from_pretrained("path", num_labels=num_labels ,attn_implementation="eager" )

AssertionError: Flash Attention is not available, but is needed for dense attention

Or should I try the ONNX version or it is just for inference?
Thank you.

antonpolishko · July 31, 2024, 9:16am

v100 is Volta generation GPU and has Compute Compatibility 7.0, which is not supported by flash-attention. You need to have Ampere or later GPUs.

Based on transformers/src/transformers/models/phi3/modeling_phi3.py at 47c29ccfaf56947d845971a439cbe75a764b63d7 · huggingface/transformers · GitHub you should be able to run it in “eager” mode. Make sure you have the latest transformers library

roman174 · July 31, 2024, 12:30pm

@antonpolishko
Thank you for response.
Yes, I have the latest transformers library and I tried this:
model = AutoModelForSequenceClassification.from_pretrained("path", num_labels=num_labels ,attn_implementation="eager" )
But still I have the error:
AssertionError: Flash Attention is not available, but is needed for dense attention
Do you have some idea what to do?

roman174 · August 6, 2024, 7:36am

I opened an issue on github at trnasformers. Unable to load model in eager mode.

Topic		Replies	Views
Phi3 Mini 4k Instruct Flash Attention not found 🤗Transformers	4	5055	May 11, 2024
Finetuning a small LLM on 32GB, 4vCPU 🤗Transformers	0	176	July 12, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! Beginners	0	201	July 3, 2024
Removal of assert from phi-3-small init Models	2	38	August 20, 2024
Unable to load a FineTuned LLama Model to GPU for inference Beginners	3	2973	December 15, 2023

Load Phi 3 small on Nvidia Tesla V100 - Flash Attention

Related topics