Run_mlm.py using --sharded_ddp "zero_dp_3 offload" gives AssertionError

clin · April 15, 2021, 2:02am

I’m trying to run the following on a single, multi-gpu machine that has 8 GPUs:

python -m torch.distributed.launch --nproc_per_node=8 \
run_mlm.py \
--model_name_or_path roberta-base \ 
--use_fast_tokenizer \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --do_eval \
--num_train_epochs 5 \
--output_dir ./experiments/wikitext \ 
--fp16 \
--sharded_ddp "zero_dp_3 offload"

This fails with the following AssertionError:

Traceback (most recent call last): File "run_mlm.py", line 492, in <module> main() File "run_mlm.py", line 458, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/me/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1120, in train tr_loss += self.training_step(model, inputs) File "/home/me/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1522, in training_step loss = self.compute_loss(model, inputs) File "/home/me/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1556, in compute_loss outputs = model(**inputs) File "/home/me/ve/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 902, in forward self._lazy_init() File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 739, in _lazy_init self._init_param_attributes(p) File "/home/me/ve/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 796, in _init_param_attributes assert p._fp32_shard.device == torch.device("cpu") AssertionError

If I omit the “offload” option to --sharded_ddp, it runs with no problems

CUDA 11.0
PyTorch 1.7.1+cu110
Huggingface 4.5.1

Has anyone successfully gotten this to work? Any help much appreciated!

sgugger · April 15, 2021, 11:43am

Last time I checked, it was blocked by a bug on fairscale side, but that yielded a different error message than this one. Will take a look this morning.

In any case solving this first bug will only get you in the second one, so you should use deepspeed for ZeRO DP3 with offload.

clin · April 15, 2021, 5:24pm

Thank you for the pointer (as well as in the bug report)!

clin · April 21, 2021, 9:30pm

Hello again.

After taking your advice, I tried run_mlm.py on roberta-base under deepspeed using ZeRO-3 + cpu_offload. This is on an AWS p4d.24xlarge instance, so 8 x A100 GPUs.

Do you have any tips on tuning the deepspeed parameters so as to maximize GPU utilization? Right now, no matter how I tune the parameters, I cannot get the volatile GPU utilization (as reported by nvidia-smi) to above ~50% on average. At first I thought it was due to cpu_offload causing communication stalls, but then I turned off cpu_offload and the GPUs were still very much underutilized. I’ve tuned the per-device batch size to just below OOM, so I thought maybe the batch size is still too small for the A100s to break a sweat! But…

… using fairscale and --sharded_ddp=‘zero_dp_3’, I am able to max out the GPU utilization (and train almost 2x faster), even though I have a slightly smaller per-device batch size.

I should note that I’m using deepspeed not so much for training a big model (roberta-base is not that big) but rather to try to jam large batch sizes onto the GPUs to accelerate training.

Any tips would be greatly appreciated!

Topic		Replies	Views
11B model gets OOM after using deepspeed zero 3 setting with 8 32G V100 🤗Accelerate	2	1361	April 26, 2025
2B Model Fill Up Memory Usage on 4xA100s 🤗Transformers	1	146	April 10, 2025
Deepspeed stage 3 partition DeepSpeed	0	611	October 31, 2023
Deepspeed zero-2 cpu offloading killing process = -9 error DeepSpeed	1	1826	March 17, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 🤗Accelerate	1	773	May 31, 2024

Run_mlm.py using --sharded_ddp "zero_dp_3 offload" gives AssertionError

Related topics