DeepSpeed Zero causes intermittent GPU usage

adrienchaton · December 18, 2024, 6:47pm

Hi everyone,

I am training LLMs on a 8*A100-80GB DGX and due to context length I am being bottlenecked by memory, i.e. I use per-device batch size = 1 and gradient accumulation up to the target batch size. At this stage PEFT isn’t an option, it is full parameter training.

So far I have been using gradient checkpointing to push the context length at the expense of decrease in training speed, as it should be expected.

Next I have been experimenting with DeepSpeed Zero2/3 in hope to reduce memory consumption with less decrease in training speed. In practice this has not been successful so I am asking for advice on how I may refine DeepSpeed settings or what may be an issue I have overlooked please…

I am observing that the GPU usage is intermittent when using DeepSpeed, whereas for gradient checkpointing it is consistently at 100%. I assume that CPU offloading is a key factor of GPU memory saving, by transferring back and forth data to RAM but the Zero implementation should optimize that by interleaving computation and communication in way to keep GPUs busy… right?

As a result, currently I can train faster and with longer context length using gradient checkpointing over DeepSpeed Zero … and Zero 3 is slower than Zero 2 as it adds more communication overhead (?)

I would have expected the opposite, i.e. gradient checkpointing < zero 2 < zero 3 (in terms of speed)

Here is my configuration passed to the trainer argument “deepspeed=dsc” with zero_stage in {2, 3}

dsc = {
“train_batch_size”: “auto”,
“train_micro_batch_size_per_gpu”: “auto”,
“gradient_accumulation_steps”: “auto”,
“gradient_clipping”: “auto”,
“zero_allow_untested_optimizer”: True,
“bf16”: {
“enabled”: “auto”
},
“zero_optimization”: {
“stage”: zero_stage,
“contiguous_gradients”: True,
“stage3_max_live_parameters”: 1e9,
“stage3_max_reuse_distance”: 1e9,
“stage3_prefetch_bucket_size”: 1e7,
“stage3_param_persistence_threshold”: 1e5,
“reduce_bucket_size”: 1e7,
“sub_group_size”: 1e9,
“offload_optimizer”: {
“device”: “cpu”,
“pin_memory”: True
},
“offload_param”: {
“device”: “cpu”,
“pin_memory”: True
}
},
“activation_checkpointing”: {
“partition_activations”: True,
“contiguous_memory_optimization”: True,
“cpu_checkpointing”: True
}
}

Do you spot anything I shall refine or other things to look into please?

adrienchaton · December 19, 2024, 12:59pm

I am not sure how to explain that but I am finally getting much better results with DeepSpeed.

The recipe that worked for me:

use the example configs provided in transformers/tests/deepspeed at main · huggingface/transformers · GitHub
use fused AdamW (I am using Pytorch implem.)
use gradient checkpointing together with DeepSpeed Zero3

updating the config to the one I linked didn’t bring any improvement alone

adding fused AdamW brought some improvement

adding gradient checkpointing made the overall training fantastic, i.e. no more intermittent GPU usage, greatly reduced memory consumption and fast training given the context length

this combo brought significant improvement over my best previous setup which was only fused AdamW and gradient checkpointing

and doing the same with Zero 2 did not yield the same improvement

I am not sure how transferable this is to e.g. another model, but through trial and error I finally managed to get something good for my training… good luck!

Topic		Replies	Views
CUDA Memory with DeepSpeed running on 4 GPUs is the same as 1 GPU DeepSpeed	0	1098	September 13, 2021
[Deepspeed] ZeRO-Infinity integration released and config changes DeepSpeed	2	2314	April 28, 2021
ZeRO uses more RAM than DDP? DeepSpeed	0	1068	August 7, 2023
Trainer option to disable saving DeepSpeed checkpoints 🤗Transformers	8	6644	May 23, 2023
Saving checkpoint is too slow with deepspeed DeepSpeed	5	2950	March 6, 2024

DeepSpeed Zero causes intermittent GPU usage

Related topics