Questions about steps with gradient accumulation

shelf · June 1, 2022, 7:47am

For example, I use batch size=64, max steps=1000, log, evaluate and save steps=100.

Question 1: I want to confirm that if I use gradient accumulation (batch size=32, gradient accumulation steps=2), I don’t need to change step args (1000 → 2000, 100 → 200). Will transformers take care of gradient_accumulation_steps * xxx_step totally? I found the gradient_accumulation_steps document and this discussion. But I didn’t find any exact example.

Question 2: How about warm-up? Do I just keep the original args config?

Question 3: Will the data be fed into model in the same order? (I don’t set seed with --seed, so the default seed 42 is used)

logvinata · July 19, 2023, 11:47am

Hi, @shelf! Did you find the answers?

Topic		Replies	Views
Gradient accumulation: should I duplicate data? 🤗Transformers	7	1014	February 1, 2021
Using gradient_accumulation_steps does not give the same results 🤗Accelerate	0	518	February 18, 2023
Any incompatibility of gradient_accumulation with the streaming data? 🤗Transformers	0	251	July 10, 2023
Batch size, gradient accumulation steps for Linear schedule Models	0	717	May 1, 2021
Bug in gradient accumulation training_step in huggingface Trainer? 🤗Transformers	3	864	November 2, 2024

Questions about steps with gradient accumulation

Related topics