Custom 4D attention masks in training OOM

jpcorb20 · September 19, 2024, 9:58pm

I am trying to do packing with 4d attention masks with Phi3-mini-4k-instruct to delimit attention only to unique sequences in one packed sequence, but I always get OOM… Any advice on this? Could we get an example of usage?

jpcorb20 · September 24, 2024, 8:41pm

pretty much similar to the packing here: 4D masks support in Transformers (huggingface.co)

John6666 · September 24, 2024, 10:39pm

What can I say…
I can hardly find any examples of actual operation. I think the library is still underdeveloped in general for that application.

github.com/huggingface/transformers

Using Trainer + a pretrained tokenizer + 4D attention mask is extremely slow

opened 09:33PM - 19 Jul 24 UTC

JackCai1206

Good Difficult Issue bug

### System Info transformers 4.41.0 ### Who can help? …@ArthurZucker ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction ``` from transformers import LlamaForCausalLM, LlamaConfig, TrainingArguments, Trainer, AutoTokenizer from datasets import IterableDataset import numpy as np model_config = LlamaConfig( vocab_size=10, hidden_size=384, num_hidden_layers=6, num_attention_heads=6, intermediate_size=1024, max_position_embeddings=1024, ) model = LlamaForCausalLM(model_config) tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m') def get_data1(): for i in range(10000): yield {'input_ids': np.zeros(1024, dtype=int), 'labels': np.zeros(1024, dtype=int), 'attention_mask': np.zeros((1, 1024, 1024), dtype=float)} def get_data2(): for i in range(10000): yield {'input_ids': np.zeros(1024, dtype=int), 'labels': np.zeros(1024, dtype=int), 'attention_mask': np.zeros((1024), dtype=int)} ds_slow = IterableDataset.from_generator(get_data1).with_format('torch') ds_fast = IterableDataset.from_generator(get_data2).with_format('torch') training_args = TrainingArguments(max_steps=1, output_dir='./out', report_to=None, per_device_train_batch_size=32, gradient_accumulation_steps=32) trainer1 = Trainer(model, training_args, train_dataset=ds_slow, tokenizer=tokenizer) trainer2 = Trainer(model, training_args, train_dataset=ds_fast, tokenizer=tokenizer) import cProfile cProfile.run('trainer1.train()', './test_slow.profile') cProfile.run('trainer2.train()', './test_fast.profile') ``` ``` import pstats # compare the two profiles p1 = pstats.Stats('./test_slow.profile') p2 = pstats.Stats('./test_fast.profile') p1.sort_stats('cumtime').print_stats() ``` ``` 1582200 function calls (1401111 primitive calls) in 340.112 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 340.112 340.112 {built-in method builtins.exec} 1 0.000 0.000 340.112 340.112 <string>:1(<module>) 1 0.000 0.000 340.112 340.112 trainer.py:1784(train) 1 0.017 0.017 340.112 340.112 trainer.py:1892(_inner_training_loop) 33 0.001 0.000 326.171 9.884 data_loader.py:663(__iter__) 33 0.001 0.000 325.473 9.863 data_loader.py:618(_fetch_batches) 2486/265 0.001 0.000 325.428 1.228 {built-in method builtins.next} 33 0.001 0.000 325.088 9.851 dataloader.py:625(__next__) 33 0.725 0.022 325.083 9.851 dataloader.py:672(_next_data) 33 0.002 0.000 323.988 9.818 fetch.py:24(fetch) 33 0.000 0.000 320.979 9.727 trainer_utils.py:807(__call__) 33 0.000 0.000 320.971 9.726 data_collator.py:270(__call__) 33 16.982 0.515 320.971 9.726 data_collator.py:52(pad_without_fast_tokenizer_warning) 33 0.005 0.000 303.989 9.212 tokenization_utils_base.py:3209(pad) 6493 235.747 0.036 235.747 0.036 {built-in method torch.tensor} 197 0.001 0.000 234.735 1.192 tokenization_utils_base.py:204(__init__) 197 0.001 0.000 234.732 1.192 tokenization_utils_base.py:681(convert_to_tensors) 99 0.000 0.000 234.730 2.371 tokenization_utils_base.py:718(as_tensor) ``` ``` p2.sort_stats('cumtime').print_stats() ``` ``` 1567440 function calls (1386340 primitive calls) in 16.431 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 16.431 16.431 {built-in method builtins.exec} 1 0.000 0.000 16.431 16.431 <string>:1(<module>) 1 0.000 0.000 16.431 16.431 trainer.py:1784(train) 1 0.018 0.018 16.431 16.431 trainer.py:1892(_inner_training_loop) 32 0.003 0.000 14.327 0.448 trainer.py:3212(training_step) 32 0.001 0.000 8.830 0.276 accelerator.py:2093(backward) 32 0.000 0.000 8.829 0.276 _tensor.py:433(backward) 32 0.000 0.000 8.829 0.276 __init__.py:149(backward) 32 8.827 0.276 8.827 0.276 {method 'run_backward' of 'torch._C._EngineBase' objects} 33 0.000 0.000 4.546 0.138 memory.py:147(empty_cache) 33 4.546 0.138 4.546 0.138 {built-in method torch._C._cuda_emptyCache} 2486/265 0.001 0.000 1.469 0.006 {built-in method builtins.next} 33 0.001 0.000 1.160 0.035 data_loader.py:663(__iter__) 33 0.000 0.000 1.145 0.035 data_loader.py:618(_fetch_batches) 33 0.000 0.000 1.136 0.034 dataloader.py:625(__next__) 33 0.003 0.000 1.134 0.034 dataloader.py:672(_next_data) 33 0.002 0.000 1.124 0.034 fetch.py:24(fetch) 32 0.000 0.000 0.955 0.030 trainer.py:3254(compute_loss) ... 1 0.000 0.000 0.000 0.000 modeling_utils.py:903(_ ... ``` ### Expected behavior Since the trace of the profiler is really long I only included the first few lines. I am running a small llama model on some dummy data, the only difference between the two datasets is that the slow version outputs 4D attention masks, which is a feature recently added in #27539. I am running both trainers for 1 iteration. As you can see the slow run is 340s while the fast one runs in 16s. The slow version of the trainer is many times slower than the fast version. The problem probably lies in the default collator `DataCollatorWithPadding` (when there is a pretrained tokenizer), which calls `tokenizer.pad` on the 4D attention masks. When you takeaway either 1) the pretrained tokenizer or 2) the 4D attention mask, trainer runs much faster.

github.com/huggingface/transformers

custom 4d attention masks broken by #28937

opened 08:16PM - 07 Mar 24 UTC

closed 05:32PM - 19 Mar 24 UTC

poedator

### System Info The 4.38.2 version breaks code using custom 4d attention mask…s (introduced in #27539). Apparently, the custom masks gets replaced here:https://github.com/huggingface/transformers/blob/4ed9ae623d16876ad84ea89dfdf1c9378e36961b/src/transformers/models/llama/modeling_llama.py#L660-L662 The issue was introduced with #28937. It is unclear whether the relevant slow tests for 4d masks were run then, but they fail now: ``` RUN_SLOW=1 python -m pytest -v ./tests/test_modeling_utils.py::Mask4DTestFP32 FAILED tests/test_modeling_utils.py::Mask4DTestFP32::test_attention - AttributeError: 'NoneType' object has no attribute 'shape' FAILED tests/test_modeling_utils.py::Mask4DTestFP32::test_causal_model_logits - AssertionError: Tensor-likes are not close! FAILED tests/test_modeling_utils.py::Mask4DTestFP32::test_inner_model - AssertionError: Tensor-likes are not close! RUN_SLOW=1 python -m pytest -v ./tests/test_modeling_utils.py::Mask4DTestFP16 FAILED tests/test_modeling_utils.py::Mask4DTestFP16::test_attention - AttributeError: 'NoneType' object has no attribute 'shape' FAILED tests/test_modeling_utils.py::Mask4DTestFP16::test_causal_model_logits - AssertionError: Tensor-likes are not close! ``` please fix or suggest workaround summoning @ArthurZucker cc @gante @younesbelkada

jpcorb20 · September 25, 2024, 2:36pm

Indeed, thanks for sharing these issues. Since my version is quite recent (4.40.2), I don’t think the second would apply … For more context, I am seeing OOM of hundreds of gigabits which actually does not make sense, I tried to reduce batch size and other memory strategies without success. If I change my attention masks to regular ones, it works fine … which I dont understand since behind the scene they should convert into 4d as well based on what I have seen in the codebase.

John6666 · September 25, 2024, 2:52pm

OOM of hundreds of gigabits which actually does not make sense,

I don’t know if it’s your code, your model, or the HF library, but something is definitely wrong.

I’d say it’s most likely the HF library. The next best thing would be options to pass to the library. Like data type or device specification. Rare case is torch, since there shouldn’t be any major changes now.
I don’t think it would take much more than a wrong model structure to do that, and it’s hard to find a case where a mistake in the code would do that, unless it was intentional.
Articles are usually written assuming a github version, so I think you should pip install git+. It’s so-so, there are frequent bugs, but no real harm unless it’s commercial. Also, in a GPU environment, the presence or absence of the accelerate library, its version, and the state of bugs can have a big impact. I would not install the development version of this because I’m afraid to do so.

jpcorb20 · October 5, 2024, 8:31pm

Looks like it might be related to flash attention 2 implementation, especially how it is implemented for Phi3 … When I am switching to eager attention, it is starting, but much less efficient. Is someone familiar with how to adapt flash attention 2 to custom 4d masks?

John6666 · October 6, 2024, 12:36am

It would be faster to find an HF account of someone who seems to know more and send a mention (@+username). Or we can look for other ways to speed up the process.

system · October 7, 2024, 1:21pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Adding attention mask into MLM 🤗Transformers	1	322	May 30, 2024
The Correct Attention Mask For Examples Packing Intermediate	6	2776	January 8, 2025
Does the transformers Trainer.train() automatically set positional attention masks? 🤗Transformers	4	451	June 3, 2024
Attention mask shape (custom attention masking) 🤗Transformers	3	534	April 27, 2025
Do automatically generated attention masks ignore padding? 🤗Transformers	4	16163	March 8, 2022

Custom 4D attention masks in training OOM

Related topics