Custom 4D attention masks in training OOM

I am trying to do packing with 4d attention masks with Phi3-mini-4k-instruct to delimit attention only to unique sequences in one packed sequence, but I always get OOM… Any advice on this? Could we get an example of usage?

1 Like

pretty much similar to the packing here: 4D masks support in Transformers (huggingface.co)

1 Like

What can I say…
I can hardly find any examples of actual operation. I think the library is still underdeveloped in general for that application.

1 Like

Indeed, thanks for sharing these issues. Since my version is quite recent (4.40.2), I don’t think the second would apply … For more context, I am seeing OOM of hundreds of gigabits which actually does not make sense, I tried to reduce batch size and other memory strategies without success. If I change my attention masks to regular ones, it works fine … which I dont understand since behind the scene they should convert into 4d as well based on what I have seen in the codebase.

1 Like

OOM of hundreds of gigabits which actually does not make sense,

I don’t know if it’s your code, your model, or the HF library, but something is definitely wrong.:sweat_smile:

I’d say it’s most likely the HF library. The next best thing would be options to pass to the library. Like data type or device specification. Rare case is torch, since there shouldn’t be any major changes now.
I don’t think it would take much more than a wrong model structure to do that, and it’s hard to find a case where a mistake in the code would do that, unless it was intentional.
Articles are usually written assuming a github version, so I think you should pip install git+. It’s so-so, there are frequent bugs, but no real harm unless it’s commercial. Also, in a GPU environment, the presence or absence of the accelerate library, its version, and the state of bugs can have a big impact. I would not install the development version of this because I’m afraid to do so.

1 Like

Looks like it might be related to flash attention 2 implementation, especially how it is implemented for Phi3 … When I am switching to eager attention, it is starting, but much less efficient. Is someone familiar with how to adapt flash attention 2 to custom 4d masks?

1 Like

It would be faster to find an HF account of someone who seems to know more and send a mention (@+username). Or we can look for other ways to speed up the process.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.