How to run single-node, multi-GPU training with HF Trainer?

Hi,

I want to train Trainer scripts on single-node, multi-GPU setting.
Do I need to launch HF with a torch launcher (torch.distributed, torchX, torchrun, Ray Train, PTL etc) or can the HF Trainer alone use multiple GPUs without being launched by a third-party distributed launcher?

1 Like

See the documentation on running scripts. :slight_smile:

I think the docs are insufficient. See my questions here: Using Transformers with DistributedDataParallel — any examples?

4 Likes

My impression with HF Trainer is HF has lots of video tutorials and none talks about multi GPU training using Trainer (assuming it is so simple) but the key element is lost in the docs, which is the command to run the trainer script which is really hard to find. So the easiest API is made hard by missing to mention this script, which I finally found in one of the forums

4 Likes

I have a script that uses HF Trainer and works fine when I run it.

But if I run the command for multi-gpu training torchrun --nproc_per_node 4 my_script.py I get an error:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/jpiabrantes/rosetta/fine_tune_coder.py", line 128, in <module>
[rank1]:     main()
[rank1]:   File "/home/jpiabrantes/rosetta/fine_tune_coder.py", line 103, in main
[rank1]:     training_args = TrainingArguments(
[rank1]:   File "<string>", line 127, in __init__
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 1630, in __post_init__
[rank1]:     and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3)
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 2131, in device
[rank1]:     return self._setup_devices
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 59, in __get__
[rank1]:     cached = self.fget(obj)
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 2063, in _setup_devices
[rank1]:     self.distributed_state = PartialState(
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/accelerate/state.py", line 278, in __init__
[rank1]:     self.set_device()
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/accelerate/state.py", line 786, in set_device
[rank1]:     torch.cuda.set_device(self.device)
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank1]:     torch._C._cuda_setDevice(device)
[rank1]: RuntimeError: CUDA error: invalid device ordinal
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/jpiabrantes/rosetta/fine_tune_coder.py", line 128, in <module>
[rank2]:     main()
[rank2]:   File "/home/jpiabrantes/rosetta/fine_tune_coder.py", line 103, in main
[rank2]:     training_args = TrainingArguments(
[rank2]:   File "<string>", line 127, in __init__
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 1630, in __post_init__
[rank2]:     and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3)
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 2131, in device
[rank2]:     return self._setup_devices
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 59, in __get__
[rank2]:     cached = self.fget(obj)
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 2063, in _setup_devices
[rank2]:     self.distributed_state = PartialState(
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/accelerate/state.py", line 278, in __init__
[rank2]:     self.set_device()
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/accelerate/state.py", line 786, in set_device
[rank2]:     torch.cuda.set_device(self.device)
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank2]:     torch._C._cuda_setDevice(device)
[rank2]: RuntimeError: CUDA error: invalid device ordinal
[rank2]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank2]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank2]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

hello, I also encounter this problem, when I load the model using AutoModelForCausalLM.from_pretrained("xxx", device_map="auto"). It will shard my model on each device. And I set Trainer as follows:

args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=1
)
trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=tokenized_ds,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

trainer.train()

It will report the same error like yours :frowning: