How to run single-node, multi-GPU training with HF Trainer?

OlivierCR · June 23, 2022, 11:40am

Hi,

I want to train Trainer scripts on single-node, multi-GPU setting.
Do I need to launch HF with a torch launcher (torch.distributed, torchX, torchrun, Ray Train, PTL etc) or can the HF Trainer alone use multiple GPUs without being launched by a third-party distributed launcher?

sgugger · June 23, 2022, 12:44pm

See the documentation on running scripts.

brando · August 17, 2022, 3:38pm

I think the docs are insufficient. See my questions here: Using Transformers with DistributedDataParallel — any examples?

dimichhf · May 28, 2023, 8:56am

My impression with HF Trainer is HF has lots of video tutorials and none talks about multi GPU training using Trainer (assuming it is so simple) but the key element is lost in the docs, which is the command to run the trainer script which is really hard to find. So the easiest API is made hard by missing to mention this script, which I finally found in one of the forums

jpiabrantes · May 7, 2024, 10:18am

I have a script that uses HF Trainer and works fine when I run it.

But if I run the command for multi-gpu training torchrun --nproc_per_node 4 my_script.py I get an error:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/jpiabrantes/rosetta/fine_tune_coder.py", line 128, in <module>
[rank1]:     main()
[rank1]:   File "/home/jpiabrantes/rosetta/fine_tune_coder.py", line 103, in main
[rank1]:     training_args = TrainingArguments(
[rank1]:   File "<string>", line 127, in __init__
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 1630, in __post_init__
[rank1]:     and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3)
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 2131, in device
[rank1]:     return self._setup_devices
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 59, in __get__
[rank1]:     cached = self.fget(obj)
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 2063, in _setup_devices
[rank1]:     self.distributed_state = PartialState(
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/accelerate/state.py", line 278, in __init__
[rank1]:     self.set_device()
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/accelerate/state.py", line 786, in set_device
[rank1]:     torch.cuda.set_device(self.device)
[rank1]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank1]:     torch._C._cuda_setDevice(device)
[rank1]: RuntimeError: CUDA error: invalid device ordinal
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/jpiabrantes/rosetta/fine_tune_coder.py", line 128, in <module>
[rank2]:     main()
[rank2]:   File "/home/jpiabrantes/rosetta/fine_tune_coder.py", line 103, in main
[rank2]:     training_args = TrainingArguments(
[rank2]:   File "<string>", line 127, in __init__
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 1630, in __post_init__
[rank2]:     and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3)
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 2131, in device
[rank2]:     return self._setup_devices
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 59, in __get__
[rank2]:     cached = self.fget(obj)
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/transformers/training_args.py", line 2063, in _setup_devices
[rank2]:     self.distributed_state = PartialState(
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/accelerate/state.py", line 278, in __init__
[rank2]:     self.set_device()
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/accelerate/state.py", line 786, in set_device
[rank2]:     torch.cuda.set_device(self.device)
[rank2]:   File "/home/jpiabrantes/rosetta/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
[rank2]:     torch._C._cuda_setDevice(device)
[rank2]: RuntimeError: CUDA error: invalid device ordinal
[rank2]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank2]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank2]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

shadowshadow · October 16, 2024, 1:00am

hello, I also encounter this problem, when I load the model using AutoModelForCausalLM.from_pretrained("xxx", device_map="auto"). It will shard my model on each device. And I set Trainer as follows:

args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    logging_steps=10,
    num_train_epochs=1
)
trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=tokenized_ds,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True)
)

trainer.train()

It will report the same error like yours

Topic		Replies	Views
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2078	July 5, 2023
Multi gpu training 🤗Transformers	3	6052	April 24, 2022
Boilerplate for Trainer using torch.distributed Beginners	4	2067	January 11, 2022
Training using multiple GPUs Beginners	20	20196	February 25, 2024
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	18188	September 6, 2023

How to run single-node, multi-GPU training with HF Trainer?

Related topics