Questions about deepspeed multi-node training with sharding parameters inside a single 8-gpu machine

The HF documentation provides detailed guide to ZeRO usage. ZeRO2 partitions gradient states across all the gpu nodes (world size), which greatly slows down the training speed for multi-node training (the more nodes we use, the slower the speed is). Is there a way to controlling the sharding degree so that I can only partitions the parameters and gradients only inside a single 8-gpu machine?