OOM error with multi-GPU training of Llama2-70B using QLora

amang1802 · October 15, 2023, 4:56am

I am trying to train Llama2-70B model using 4-bit QLora on a 8xA100 80G instance. When using only a single GPU, it runs comfortably - uses < 50G of VRAM with a batch size of 2. I am also setting gradient_accumulation_steps = 4.

But when I run it on 8 GPUs, it consistently OOMs without completing a single step, even with per device batch size = 1. How do I debug this?

Accelerate config: https://gist.github.com/amangup/3a5f80a541d8226ca2101389e8bf1805

Training script: https://gist.github.com/amangup/e49ca9fc042caa062eeb6c1355fcd6c9

I run the script by just running accelerate launch llama70_qlora_multigpu.py

amang1802 · October 17, 2023, 3:54am

So I did a few more things:

Ran the script on a 7B model, and the training completed. But, the per GPU memory cost was 24-28GB/GPU, compared to < 20GB for single GPU training (with the same batch size).
Changed the precision to fp16 from bf16 (fp16 is the dtype defined in the config.json for the llama2 models), and surprisingly it completed one step, and ran OOM in step 2. Not sure why that would affect memory usage.
I read this explanation of DDP. It looks like the extra memory usage (compared to single GPU execution) is limited to the gradients of a single tensor from all GPUs, which shouldn’t result in so much in extra memory usage that I’m observing!
I assumed that the memory overhead due to DDP is proportional to parallelism. So I ran it only on 4 GPUs. This time it failed on step 5 instead of step 2. So it sounds like the DDP overhead is pretty huge.

I wonder if the size of the reducer bucket is tunable.

amang1802 · October 17, 2023, 6:30pm

So, I finally got this to work. I had to turn gradient checkpointing off to deal with an error I was seeing. Turns out I can tweak a ddp param and it will allow gradient checkpointing to work. That brought down the memory requirement enough for this to succeed on all 8 GPUs.

Topic		Replies	Views
OOM when I using torch.nn.parallel.DistributedDataParallel to train LLAMA-7B Beginners	0	721	May 12, 2023
Hugging face accelerate and torch DDP crash with out-of-memory errors for a model runs fine on a single GPU 🤗Accelerate	3	4447	January 1, 2024
Fine tune Meta-Llama-3.1-8B OOM error after the 1st training step Models	0	162	September 6, 2024
LLama3-8B - FSDP + QLORA results in OOM with 4 A40's 🤗Accelerate	1	860	June 17, 2024
LLaMA2 7B uses > 128 GB of GPU Ram and fails with OOM or Loss Scale Minimum 🤗Transformers	3	5562	August 17, 2023

OOM error with multi-GPU training of Llama2-70B using QLora

Related topics