How do I debug "Error -7" when attempting to finetune a transformer model?

jwatte · May 24, 2023, 5:12pm

I’m running on a g5.24xlarge instance on EC2.
I’m using the transformers library and trying to fine-tune mpt-7b-instruct.
The training script starts up, downloads the model, loads my dataset, and then errors out in “building trainer” with:

Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef4f1a35606f162a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Building trainer...                                                                                                                                                   
ERROR:composer.cli.launcher:Rank 1 crashed with exit code -7.

There’s no stack trace or anything, everything looks kosher until this error code. My google-fu hasn’t managed to find any indication of what error -7 might mean.

The stderr for that rank shows noting that I would think shows the error:

/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-instruct/a858cfabdc6bf69c03ce63236a5e877517bb957c/attention.py:153: UserWarning: While `attn_impl: triton` can be faster than `attn_impl: flash` it uses more memory. When training larger models this can trigger alloc retries which hurts performance. If encountered, we recommend using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.
  warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:06<00:06,  6.96s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.35s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:09<00:00,  4.74s/it]
Using pad_token, but it is not set yet.
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef4f1a35606f162a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)

Map:   0%|          | 0/2610 [00:00<?, ? examples/s]
Map:   4%|▍         | 112/2610 [00:00<00:02, 1101.55 examples/s]
Map:   9%|▊         | 227/2610 [00:00<00:02, 1126.25 examples/s]
Map:  15%|█▍        | 382/2610 [00:00<00:02, 1065.24 examples/s]
Map:  20%|██        | 533/2610 [00:00<00:02, 1035.03 examples/s]
Map:  25%|██▍       | 641/2610 [00:00<00:01, 1045.58 examples/s]
Map:  29%|██▉       | 751/2610 [00:00<00:01, 1057.60 examples/s]
Map:  33%|███▎      | 871/2610 [00:00<00:01, 1098.15 examples/s]
Map:  38%|███▊      | 1000/2610 [00:01<00:01, 888.27 examples/s]
Map:  44%|████▍     | 1144/2610 [00:01<00:01, 909.71 examples/s]
Map:  48%|████▊     | 1244/2610 [00:01<00:01, 928.00 examples/s]
Map:  53%|█████▎    | 1379/2610 [00:01<00:01, 910.39 examples/s]
Map:  57%|█████▋    | 1479/2610 [00:01<00:01, 928.15 examples/s]
Map:  60%|██████    | 1576/2610 [00:01<00:01, 938.17 examples/s]
Map:  64%|██████▍   | 1676/2610 [00:01<00:00, 953.67 examples/s]
Map:  68%|██████▊   | 1773/2610 [00:01<00:00, 954.62 examples/s]
Map:  73%|███████▎  | 1915/2610 [00:01<00:00, 950.31 examples/s]
Map:  78%|███████▊  | 2045/2610 [00:02<00:00, 827.57 examples/s]
Map:  82%|████████▏ | 2135/2610 [00:02<00:00, 839.42 examples/s]
Map:  87%|████████▋ | 2264/2610 [00:02<00:00, 843.83 examples/s]
Map:  90%|█████████ | 2352/2610 [00:02<00:00, 849.49 examples/s]
Map:  93%|█████████▎| 2440/2610 [00:02<00:00, 855.17 examples/s]
Map:  97%|█████████▋| 2544/2610 [00:02<00:00, 899.43 examples/s]
                                                                
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-ef4f1a35606f162a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)

Map:   0%|          | 0/315 [00:00<?, ? examples/s]
Map:  35%|███▌      | 111/315 [00:00<00:00, 1088.88 examples/s]
Map:  86%|████████▋ | 272/315 [00:00<00:00, 1069.73 examples/s]
                                                               

----------End global rank 1 STDERR----------

I’m running the training script through the mosaicml llm-foundry composer wrapper, in case that matters.

Topic		Replies	Views
Not able to predict using Transformers Trainer class Intermediate	2	183	October 2, 2024
Werfault.exe errors appear during model fine-tuning Beginners	0	98	June 26, 2024
Getting error - trainer.train() 🤗Transformers	4	991	June 3, 2024
How to finefune blenderbot model? Beginners	1	1526	May 20, 2022
Using trainer to fine-tune the model gives an error. Seeking solution! Beginners	1	113	December 3, 2024

How do I debug "Error -7" when attempting to finetune a transformer model?

Related topics