Gradient accumulation: should I duplicate data?

marcoabrate · January 13, 2021, 10:18am

Hello!

I am using gradient accumulation to simulate bigger batches when fine-tuning. However, I remember to have seen some notebooks in the documentation where they would make N copies of the data when N is the number of gradient accumulation steps. I do not understand why this should be done. Is this good practice? Why?

Thank you

BramVanroy · January 13, 2021, 11:05am

Could you link to the exact notebook where you have seen this?

marcoabrate · January 13, 2021, 11:25am

Hey @BramVanroy, thank you for your reply. I have found the notebook, sorry for not being very precise. It’s Reformer - Pushing the Limits of Language Modeling. Around box 7 it says:

We then expand the same sample to 8 training samples so that we can accumulate gradients during training.

In the code

  # duplicate data 8 times to have have 8 examples in dataset
  for key in input_ids_dict.keys():
    input_ids_dict[key] = [8 * [x] for x in input_ids_dict[key]][0]

And gradient accumulation steps is actually 4, not 8 as I would expect. With batch size 1.

BramVanroy · January 13, 2021, 12:29pm

Hm, good question. Tagging @patrickvonplaten who created that notebook.

marcoabrate · January 18, 2021, 8:02am

Yes, would be helpful to have an update from you @patrickvonplaten

s4sarath · January 19, 2021, 3:38pm

Ideally gradient accumulation has nothing to do with data . It’s basically , in storage memory of few epochs and then do gradient update, which will have an effect of larger batch size.

patrickvonplaten · February 1, 2021, 6:47am

Yeah this might have been a bit unprecise in the notebook. So the reason I’m expanding the training data from 1 to 8 samples is a super edge-case. Since Reformer processes the whole train dataset in 1 batch, there is only one data sample in the whole dataset. Then if one uses gradient_accumulation (which as pointed out correctly has nothing to do with data replication) there is a bug if the data set is of size 1 because the training script rightfully expects the dataset to at least have > 1 training samples when gradient_accumulation is used. So my solution of expanding the dataset is more of a hack than the recommended way of doing it (actually one should never copy samples from the dataset).
I doubt there is any real application of having a dataset of batch 1 => this notebook was more of a show-off that Reformer can process the whole dataset in 1 batch, so not super relevant for real scenarios.

marcoabrate · February 1, 2021, 8:33am

This is clear now, thank you!

Topic		Replies	Views
Is there a standard way to handle leftover batches when using gradient accumulation? Intermediate	1	616	November 22, 2021
What is the limit of grad accumulation? Intermediate	2	2913	May 4, 2021
Any incompatibility of gradient_accumulation with the streaming data? 🤗Transformers	0	251	July 10, 2023
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2222	December 31, 2023
Question about Gradient Accumulation step in Trainer 🤗Transformers	2	2623	September 10, 2021

Gradient accumulation: should I duplicate data?

Related topics