Distribute training

liumaishen · November 16, 2022, 9:40am

why the training args n_gpu is set to 1 when use Trainer’s distributed training？
which means only one device is allowed in a node？
and the training parameters printed is calculated with the n_gpus=1.
I want to know what should i do when every node has multi gpus.

The script is https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py
Here is the training parameters printed

Actually， I have two nodes both have 8 gpus，so the total batch size is 32 * 8 * 2 = 512 instead of 32 * 2 = 64，which is calculated with the num_gpus=1 in a node.

Topic		Replies	Views
Setting specific device for Trainer Beginners	25	41757	July 21, 2024
Trainer attribute, n_gpu 🤗Transformers	0	164	February 28, 2024
Using 3 GPUs for training with Trainer() of transformers 🤗Transformers	2	2297	October 18, 2023
Distributed Finetuning with Trainer 🤗Transformers	0	499	September 4, 2022
How to restrict training to one GPU if multiple are available, co 🤗Transformers	4	14343	November 1, 2023

Distribute training

Related topics