Hey @BramVanroy, thank you for your reply. I have found the notebook, sorry for not being very precise. It’s Reformer - Pushing the Limits of Language Modeling. Around box 7 it says:
We then expand the same sample to 8 training samples so that we can accumulate gradients during training.
In the code
# duplicate data 8 times to have have 8 examples in dataset
for key in input_ids_dict.keys():
input_ids_dict[key] = [8 * [x] for x in input_ids_dict[key]][0]
And gradient accumulation steps is actually 4, not 8 as I would expect. With batch size 1.