DeepSpeed Zero causes intermittent GPU usage

adrienchaton · December 19, 2024, 12:59pm

I am not sure how to explain that but I am finally getting much better results with DeepSpeed.

The recipe that worked for me:

use the example configs provided in transformers/tests/deepspeed at main · huggingface/transformers · GitHub
use fused AdamW (I am using Pytorch implem.)
use gradient checkpointing together with DeepSpeed Zero3

updating the config to the one I linked didn’t bring any improvement alone

adding fused AdamW brought some improvement

adding gradient checkpointing made the overall training fantastic, i.e. no more intermittent GPU usage, greatly reduced memory consumption and fast training given the context length

this combo brought significant improvement over my best previous setup which was only fused AdamW and gradient checkpointing

and doing the same with Zero 2 did not yield the same improvement

I am not sure how transferable this is to e.g. another model, but through trial and error I finally managed to get something good for my training… good luck!

Topic		Replies	Views
2B Model Fill Up Memory Usage on 4xA100s 🤗Transformers	1	146	April 10, 2025
Saving checkpoint is too slow with deepspeed DeepSpeed	5	2913	March 6, 2024
ZeRO uses more RAM than DDP? DeepSpeed	0	1063	August 7, 2023
CUDA Memory with DeepSpeed running on 4 GPUs is the same as 1 GPU DeepSpeed	0	1094	September 13, 2021
Trainer option to disable saving DeepSpeed checkpoints 🤗Transformers	8	6604	May 23, 2023

DeepSpeed Zero causes intermittent GPU usage

Related topics