[disclaimer: sorry about the âtrickyâ links (I added some spaces), but ânew users canât post more than 2 linksâ )]
Hi!
I was with the same question, so I started to look for some information. If you check the quick tour page documentation before v0.28.0 (e.g., https :// huggingface . co/docs/accelerate/v0.27.0/en/quicktour) you will see a warning about the batch size:
The actual batch size for your training will be the number of devices used multiplied by the batch size you set in your script. For instance, training on 4 GPUs with a batch size of 16 set when creating the training dataloader will train at an actual batch size of 64 (4 * 16). If you want the batch size remain the same regardless of how many GPUs the script is run on, you can use the option split_batches=True when creating and initializing Accelerator. Your training dataloader may change length when going through this method: if you run on X GPUs, it will have its length divided by X (since your actual batch size will be multiplied by X), unless you set split_batches=True.
And if you check the same page but for versions before v0.24.0 you would only see:
The actual batch size for your training will be the number of devices used multiplied by the batch size you set in your script: for instance training on 4 GPUs with a batch size of 16 set when creating the training dataloader will train at an actual batch size of 64.
So, before v0.28.0 it seems that you had to take this into account in order to calculate your actual batch size, unless you were using with accelerator.accumulate()
. In recent versions I think you still have to do this operation (https://huggingface.co/docs/accelerate/v0.29.3/en/concept_guides/performance#observed-batch-sizes). However, you can see Performing gradient accumulation with đ¤ Accelerate in order to know which case applies to your code if you also are using gradient accumulation. If youâre not using with accelerator.accumulate()
I think your actual batch size is 3 because youâre not using if (index+1) % gradient_accumulation_steps == 0:
as it is used in the mentioned page, at least with the code you provided. Iâd like if someone could say if Iâm wrong to clarify the current status of HF accelerate since the documentation was updated in v0.28.0 with respect the actual batch size. As far as Iâve understand the current documentation (example: assuming 8 processes, 1 GPU each, and batch_size=64
):
- If youâre using
Accelerator(gradient_accumulation_steps=1); with accelerator.accumulate():
, then actualbatch_size=64 * 8
- If youâre using
Accelerator(gradient_accumulation_steps=2); with accelerator.accumulate():
, then actualbatch_size=64 * 8 * 2
- If youâre using
Accelerator(gradient_accumulation_steps=1)
, then actualbatch_size=64
- If youâre using
Accelerator(gradient_accumulation_steps=1); if (index+1) % gradient_accumulation_steps == 0: update_optimizer()
, then actualbatch_size=64 * 8
Iâve seen that @muellerzr and @marcsun13 are frequent posters. Iâd appretiate if some of you could give some light regarding this âactualâ batch size and indicate if the provided examples above are correct
Thanks in advance!