Regarding the argument `per_device_train_batch_size`

zy113 · July 2, 2024, 4:16am

Regarding the argument per_device_train_batch_size in SFTTrainer, my question is, is it named like this because by fault, if I have more than 1 gpu, the SFTTrainer will do data parallelization (as opposed to sharding one single model and distribute it onto multiple gpus), so let’s say if I have n gpus and per_device_train_batch_size=4, then the total data used per round of backprop is 4n? But if this is the case the memory occupied on each gpu should be the same regardless of how many gpus I have, but this is not what I observed…

Topic		Replies	Views
Per_device_train_batch_size in model parallelism Beginners	2	37	April 7, 2025
Trainer) training one batch with multiple GPUs DeepSpeed	0	395	June 19, 2023
How is the number of steps calculated in trl's SFTTrainer under multiple-GPU? 🤗Transformers	2	2844	December 5, 2023
Why are there only 3 steps per epoch when the dataset has 2500 rows and batch_size is 1 Beginners	0	166	March 19, 2024
Eval_batch_size VS per_device_eval_batch_size DeepSpeed	0	895	August 4, 2023

Regarding the argument `per_device_train_batch_size`

Related topics