I am trying to do packing with 4d attention masks with Phi3-mini-4k-instruct to delimit attention only to unique sequences in one packed sequence, but I always get OOM… Any advice on this? Could we get an example of usage?
1 Like
What can I say…
I can hardly find any examples of actual operation. I think the library is still underdeveloped in general for that application.
opened 09:33PM - 19 Jul 24 UTC
Good Difficult Issue
bug
### System Info
transformers 4.41.0
### Who can help?
… @ArthurZucker
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)
### Reproduction
```
from transformers import LlamaForCausalLM, LlamaConfig, TrainingArguments, Trainer, AutoTokenizer
from datasets import IterableDataset
import numpy as np
model_config = LlamaConfig(
vocab_size=10,
hidden_size=384,
num_hidden_layers=6,
num_attention_heads=6,
intermediate_size=1024,
max_position_embeddings=1024,
)
model = LlamaForCausalLM(model_config)
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-125m')
def get_data1():
for i in range(10000):
yield {'input_ids': np.zeros(1024, dtype=int), 'labels': np.zeros(1024, dtype=int), 'attention_mask': np.zeros((1, 1024, 1024), dtype=float)}
def get_data2():
for i in range(10000):
yield {'input_ids': np.zeros(1024, dtype=int), 'labels': np.zeros(1024, dtype=int), 'attention_mask': np.zeros((1024), dtype=int)}
ds_slow = IterableDataset.from_generator(get_data1).with_format('torch')
ds_fast = IterableDataset.from_generator(get_data2).with_format('torch')
training_args = TrainingArguments(max_steps=1, output_dir='./out', report_to=None, per_device_train_batch_size=32, gradient_accumulation_steps=32)
trainer1 = Trainer(model, training_args, train_dataset=ds_slow, tokenizer=tokenizer)
trainer2 = Trainer(model, training_args, train_dataset=ds_fast, tokenizer=tokenizer)
import cProfile
cProfile.run('trainer1.train()', './test_slow.profile')
cProfile.run('trainer2.train()', './test_fast.profile')
```
```
import pstats
# compare the two profiles
p1 = pstats.Stats('./test_slow.profile')
p2 = pstats.Stats('./test_fast.profile')
p1.sort_stats('cumtime').print_stats()
```
```
1582200 function calls (1401111 primitive calls) in 340.112 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 340.112 340.112 {built-in method builtins.exec}
1 0.000 0.000 340.112 340.112 <string>:1(<module>)
1 0.000 0.000 340.112 340.112 trainer.py:1784(train)
1 0.017 0.017 340.112 340.112 trainer.py:1892(_inner_training_loop)
33 0.001 0.000 326.171 9.884 data_loader.py:663(__iter__)
33 0.001 0.000 325.473 9.863 data_loader.py:618(_fetch_batches)
2486/265 0.001 0.000 325.428 1.228 {built-in method builtins.next}
33 0.001 0.000 325.088 9.851 dataloader.py:625(__next__)
33 0.725 0.022 325.083 9.851 dataloader.py:672(_next_data)
33 0.002 0.000 323.988 9.818 fetch.py:24(fetch)
33 0.000 0.000 320.979 9.727 trainer_utils.py:807(__call__)
33 0.000 0.000 320.971 9.726 data_collator.py:270(__call__)
33 16.982 0.515 320.971 9.726 data_collator.py:52(pad_without_fast_tokenizer_warning)
33 0.005 0.000 303.989 9.212 tokenization_utils_base.py:3209(pad)
6493 235.747 0.036 235.747 0.036 {built-in method torch.tensor}
197 0.001 0.000 234.735 1.192 tokenization_utils_base.py:204(__init__)
197 0.001 0.000 234.732 1.192 tokenization_utils_base.py:681(convert_to_tensors)
99 0.000 0.000 234.730 2.371 tokenization_utils_base.py:718(as_tensor)
```
```
p2.sort_stats('cumtime').print_stats()
```
```
1567440 function calls (1386340 primitive calls) in 16.431 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 16.431 16.431 {built-in method builtins.exec}
1 0.000 0.000 16.431 16.431 <string>:1(<module>)
1 0.000 0.000 16.431 16.431 trainer.py:1784(train)
1 0.018 0.018 16.431 16.431 trainer.py:1892(_inner_training_loop)
32 0.003 0.000 14.327 0.448 trainer.py:3212(training_step)
32 0.001 0.000 8.830 0.276 accelerator.py:2093(backward)
32 0.000 0.000 8.829 0.276 _tensor.py:433(backward)
32 0.000 0.000 8.829 0.276 __init__.py:149(backward)
32 8.827 0.276 8.827 0.276 {method 'run_backward' of 'torch._C._EngineBase' objects}
33 0.000 0.000 4.546 0.138 memory.py:147(empty_cache)
33 4.546 0.138 4.546 0.138 {built-in method torch._C._cuda_emptyCache}
2486/265 0.001 0.000 1.469 0.006 {built-in method builtins.next}
33 0.001 0.000 1.160 0.035 data_loader.py:663(__iter__)
33 0.000 0.000 1.145 0.035 data_loader.py:618(_fetch_batches)
33 0.000 0.000 1.136 0.034 dataloader.py:625(__next__)
33 0.003 0.000 1.134 0.034 dataloader.py:672(_next_data)
33 0.002 0.000 1.124 0.034 fetch.py:24(fetch)
32 0.000 0.000 0.955 0.030 trainer.py:3254(compute_loss)
...
1 0.000 0.000 0.000 0.000 modeling_utils.py:903(_
...
```
### Expected behavior
Since the trace of the profiler is really long I only included the first few lines.
I am running a small llama model on some dummy data, the only difference between the two datasets is that the slow version outputs 4D attention masks, which is a feature recently added in #27539. I am running both trainers for 1 iteration.
As you can see the slow run is 340s while the fast one runs in 16s.
The slow version of the trainer is many times slower than the fast version. The problem probably lies in the default collator `DataCollatorWithPadding` (when there is a pretrained tokenizer), which calls `tokenizer.pad` on the 4D attention masks. When you takeaway either 1) the pretrained tokenizer or 2) the 4D attention mask, trainer runs much faster.
opened 08:16PM - 07 Mar 24 UTC
closed 05:32PM - 19 Mar 24 UTC
### System Info
The 4.38.2 version breaks code using custom 4d attention mask… s (introduced in #27539). Apparently, the custom masks gets replaced here:https://github.com/huggingface/transformers/blob/4ed9ae623d16876ad84ea89dfdf1c9378e36961b/src/transformers/models/llama/modeling_llama.py#L660-L662
The issue was introduced with #28937. It is unclear whether the relevant slow tests for 4d masks were run then, but they fail now:
```
RUN_SLOW=1 python -m pytest -v ./tests/test_modeling_utils.py::Mask4DTestFP32
FAILED tests/test_modeling_utils.py::Mask4DTestFP32::test_attention - AttributeError: 'NoneType' object has no attribute 'shape'
FAILED tests/test_modeling_utils.py::Mask4DTestFP32::test_causal_model_logits - AssertionError: Tensor-likes are not close!
FAILED tests/test_modeling_utils.py::Mask4DTestFP32::test_inner_model - AssertionError: Tensor-likes are not close!
RUN_SLOW=1 python -m pytest -v ./tests/test_modeling_utils.py::Mask4DTestFP16
FAILED tests/test_modeling_utils.py::Mask4DTestFP16::test_attention - AttributeError: 'NoneType' object has no attribute 'shape'
FAILED tests/test_modeling_utils.py::Mask4DTestFP16::test_causal_model_logits - AssertionError: Tensor-likes are not close!
```
please fix or suggest workaround
summoning @ArthurZucker
cc @gante @younesbelkada
1 Like
Indeed, thanks for sharing these issues. Since my version is quite recent (4.40.2), I don’t think the second would apply … For more context, I am seeing OOM of hundreds of gigabits which actually does not make sense, I tried to reduce batch size and other memory strategies without success. If I change my attention masks to regular ones, it works fine … which I dont understand since behind the scene they should convert into 4d as well based on what I have seen in the codebase.
1 Like
OOM of hundreds of gigabits which actually does not make sense,
I don’t know if it’s your code, your model, or the HF library, but something is definitely wrong.
I’d say it’s most likely the HF library. The next best thing would be options to pass to the library. Like data type or device specification. Rare case is torch, since there shouldn’t be any major changes now.
I don’t think it would take much more than a wrong model structure to do that, and it’s hard to find a case where a mistake in the code would do that, unless it was intentional.
Articles are usually written assuming a github version, so I think you should pip install git+. It’s so-so, there are frequent bugs, but no real harm unless it’s commercial. Also, in a GPU environment, the presence or absence of the accelerate library, its version, and the state of bugs can have a big impact. I would not install the development version of this because I’m afraid to do so.
1 Like
Looks like it might be related to flash attention 2 implementation, especially how it is implemented for Phi3 … When I am switching to eager attention, it is starting, but much less efficient. Is someone familiar with how to adapt flash attention 2 to custom 4d masks?
1 Like
Hi, I'm following the tuto, installed all libs requiered (including flash_attn), but I cannot load the model:
It would be faster to find an HF account of someone who seems to know more and send a mention (@+username). Or we can look for other ways to speed up the process.
1 Like
system
Closed
October 7, 2024, 1:21pm
8
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.