Is the Trainer DDP or DP? If it is DDP, why do I train with multiple graphics cards, and the graphics card memory consumed on cuda-0 is much larger than other graphics cards. Or is it that when I increase per_device_train_batch_size, the cuda-0 card will exceed the graphics card memory, and then it will cut the model parameters to other cards by itself? Or do I need to set any parameters? Just give an example
Related Topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
What algorithm Trainer uses for multi GPU training (without torchrun) | 1 | 831 | January 19, 2023 | |
Trainer attribute, n_gpu | 0 | 138 | February 28, 2024 | |
Distributed training on different gpus | 0 | 215 | August 30, 2023 | |
DDP running out of memory but DP is successful for the same per_device_train_batch_size | 0 | 345 | February 5, 2024 | |
Trainer is not using multiple GPUs in the DP setup | 0 | 715 | April 9, 2023 |