Gradient accumulation: should I duplicate data?


I am using gradient accumulation to simulate bigger batches when fine-tuning. However, I remember to have seen some notebooks in the documentation where they would make N copies of the data when N is the number of gradient accumulation steps. I do not understand why this should be done. Is this good practice? Why?

Thank you :slight_smile:

Could you link to the exact notebook where you have seen this?

Hey @BramVanroy, thank you for your reply. I have found the notebook, sorry for not being very precise. It’s Reformer - Pushing the Limits of Language Modeling. Around box 7 it says:

We then expand the same sample to 8 training samples so that we can accumulate gradients during training.

In the code

  # duplicate data 8 times to have have 8 examples in dataset
  for key in input_ids_dict.keys():
    input_ids_dict[key] = [8 * [x] for x in input_ids_dict[key]][0]

And gradient accumulation steps is actually 4, not 8 as I would expect. With batch size 1.

Hm, good question. Tagging @patrickvonplaten who created that notebook.

Yes, would be helpful to have an update from you @patrickvonplaten

Ideally gradient accumulation has nothing to do with data . It’s basically , in storage memory of few epochs and then do gradient update, which will have an effect of larger batch size.