using --sharded_ddp "zero_dp_3 offload" gives AssertionError

I’m trying to run the following on a single, multi-gpu machine that has 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 \ \
--model_name_or_path roberta-base \ 
--use_fast_tokenizer \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --do_eval \
--num_train_epochs 5 \
--output_dir ./experiments/wikitext \ 
--fp16 \
--sharded_ddp "zero_dp_3 offload"

This fails with the following AssertionError:

Traceback (most recent call last): File "", line 492, in <module> main() File "", line 458, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/me/ve/lib/python3.6/site-packages/transformers/", line 1120, in train tr_loss += self.training_step(model, inputs) File "/home/me/ve/lib/python3.6/site-packages/transformers/", line 1522, in training_step loss = self.compute_loss(model, inputs) File "/home/me/ve/lib/python3.6/site-packages/transformers/", line 1556, in compute_loss outputs = model(**inputs) File "/home/me/ve/lib/python3.6/site-packages/torch/nn/modules/", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/", line 902, in forward self._lazy_init() File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/", line 739, in _lazy_init self._init_param_attributes(p) File "/home/me/ve/lib/python3.6/site-packages/torch/autograd/", line 26, in decorate_context return func(*args, **kwargs) File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/", line 796, in _init_param_attributes assert p._fp32_shard.device == torch.device("cpu") AssertionError

If I omit the “offload” option to --sharded_ddp, it runs with no problems

CUDA 11.0
PyTorch 1.7.1+cu110
Huggingface 4.5.1

Has anyone successfully gotten this to work? Any help much appreciated!

Last time I checked, it was blocked by a bug on fairscale side, but that yielded a different error message than this one. Will take a look this morning.

In any case solving this first bug will only get you in the second one, so you should use deepspeed for ZeRO DP3 with offload.

Thank you for the pointer (as well as in the bug report)!

Hello again.

After taking your advice, I tried on roberta-base under deepspeed using ZeRO-3 + cpu_offload. This is on an AWS p4d.24xlarge instance, so 8 x A100 GPUs.

Do you have any tips on tuning the deepspeed parameters so as to maximize GPU utilization? Right now, no matter how I tune the parameters, I cannot get the volatile GPU utilization (as reported by nvidia-smi) to above ~50% on average. At first I thought it was due to cpu_offload causing communication stalls, but then I turned off cpu_offload and the GPUs were still very much underutilized. I’ve tuned the per-device batch size to just below OOM, so I thought maybe the batch size is still too small for the A100s to break a sweat! But…

… using fairscale and --sharded_ddp=‘zero_dp_3’, I am able to max out the GPU utilization (and train almost 2x faster), even though I have a slightly smaller per-device batch size.

I should note that I’m using deepspeed not so much for training a big model (roberta-base is not that big) but rather to try to jam large batch sizes onto the GPUs to accelerate training.

Any tips would be greatly appreciated!