Regarding the argument per_device_train_batch_size
in SFTTrainer, my question is, is it named like this because by fault, if I have more than 1 gpu, the SFTTrainer will do data parallelization (as opposed to sharding one single model and distribute it onto multiple gpus), so let’s say if I have n gpus and per_device_train_batch_size=4
, then the total data used per round of backprop is 4n? But if this is the case the memory occupied on each gpu should be the same regardless of how many gpus I have, but this is not what I observed…