When is a generative model said to overfit?

If I train a causal language model, should I be worried about overfitting? If so, what would that imply? That it cannot generalize well to unseen prompts?

I am used to validating on downstream tasks and selecting the best checkpoint there where validation loss is not worse than training loss (overfitting), but I am not sure if that applies to CLM/generation tasks.

I guess what I am asking is:

  • do you validate your (C)LM/generation tasks during training as a means to do early stopping/finding the best checkpoint?
  • if you do not, how do you decide how long to train?

For generative models, one typically measures the perplexity on a held-out dataset. As long as perplexity keeps improving, keep training.

So similar to how you would otherwise track overfitting, only with PPL as a metric. I find that my eval loss is increasing quite rapidly when finetuning, but that still generations are not really sensible nor adapted to the new domain. If you have any ideas, shoot!