I’m trying to run the following on a single, multi-gpu machine that has 8 GPUs:
python -m torch.distributed.launch --nproc_per_node=8 \
run_mlm.py \
--model_name_or_path roberta-base \
--use_fast_tokenizer \
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --do_eval \
--num_train_epochs 5 \
--output_dir ./experiments/wikitext \
--fp16 \
--sharded_ddp "zero_dp_3 offload"
This fails with the following AssertionError:
Traceback (most recent call last): File "run_mlm.py", line 492, in <module> main() File "run_mlm.py", line 458, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/me/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1120, in train tr_loss += self.training_step(model, inputs) File "/home/me/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1522, in training_step loss = self.compute_loss(model, inputs) File "/home/me/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1556, in compute_loss outputs = model(**inputs) File "/home/me/ve/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 902, in forward self._lazy_init() File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 739, in _lazy_init self._init_param_attributes(p) File "/home/me/ve/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/home/me/ve/lib/python3.6/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 796, in _init_param_attributes assert p._fp32_shard.device == torch.device("cpu") AssertionError
If I omit the “offload” option to --sharded_ddp
, it runs with no problems
CUDA 11.0
PyTorch 1.7.1+cu110
Huggingface 4.5.1
Has anyone successfully gotten this to work? Any help much appreciated!