Questions about deepspeed multi-node training with sharding parameters inside a single 8-gpu machine

cyk1337 · October 21, 2022, 4:49am

The HF documentation provides detailed guide to ZeRO usage. ZeRO2 partitions gradient states across all the gpu nodes (world size), which greatly slows down the training speed for multi-node training (the more nodes we use, the slower the speed is). Is there a way to controlling the sharding degree so that I can only partitions the parameters and gradients only inside a single 8-gpu machine?

Topic		Replies	Views
Deepspeed ZeRO Inference DeepSpeed	1	2732	November 24, 2021
Questions about deepspeed resume training 🤗Accelerate	2	2056	October 21, 2022
ZeRO3 with int8 training DeepSpeed	0	879	August 11, 2023
Detecting single gpu within each node 🤗Accelerate	2	757	January 17, 2023
DeepSpeed Zero causes intermittent GPU usage 🤗Accelerate	1	325	December 19, 2024

Questions about deepspeed multi-node training with sharding parameters inside a single 8-gpu machine

Related topics