Hello!
I am using gradient accumulation to simulate bigger batches when fine-tuning. However, I remember to have seen some notebooks in the documentation where they would make N copies of the data when N is the number of gradient accumulation steps. I do not understand why this should be done. Is this good practice? Why?
Thank you