If I train a causal language model, should I be worried about overfitting? If so, what would that imply? That it cannot generalize well to unseen prompts?
I am used to validating on downstream tasks and selecting the best checkpoint there where validation loss is not worse than training loss (overfitting), but I am not sure if that applies to CLM/generation tasks.
I guess what I am asking is:
- do you validate your (C)LM/generation tasks during training as a means to do early stopping/finding the best checkpoint?
- if you do not, how do you decide how long to train?