I am training GPT-J with DeepSpeed. The training loss kept decreasing, which is good, but the validation loss starts climbing as early as 2 epochs. However, the oddest thing is that the training loss drop and validation loss jump seem synchronized, and always happen at the boundary of each epoch. It doesn’t look right, but I have no clue what’s the cause. Does anyone have a suggestion what to investigate? I tried to add 10 steps of warmup or reducing learning rate from the default 5e-5 to 2e-2. But no changes. I always see such “stair” shape learning curve.