The loss surface when training diffusion models is quite uninformative IMO. A good analysis of this is available here: [2303.09556] Efficient Diffusion Training via Min-SNR Weighting Strategy
On the other hand, the smaller dataset is obviously able to run more epochs in the same number of steps, so I guess it benefits more from seeing the same data again?
I’d think so but have you checked if the model overfits the data too quickly if you do that. This is something we have continuously observed in our experiments. Cc: @pcuenq @valhalla