I have a question about methodology. I have a good dataset (about 8000 samples) which I can use to train a model on an NLP task. As usual, I would normally divide the dataset into training and validation (90% - 10%), or perhaps even training, validation and test datasets (80% - 10% - 10%), and perform training as usual on the training set, then do a hyper parameter search using the performance on the validation set, and once I’m happy with my model parameters, I would run it once on the test set to get the final performance metrics.
Now, assuming (seems a reasonable assumption) that the model will not get worse if we add more training data to it, I was wondering if there is anything wrong with re-training the model with the full dataset once the steps above (validation and testing) have been completed. Obviously you can’t then re-test the model (because the newly trained model will have already seen all the samples in your dataset), but it seems reasonable to say that the performance metric you got on the test set when you only used 80% of the samples for training will not get any worse if you add the remaining 20% of the samples to the training set, hence you can keep the performance you got as a baseline, and then train the model on the full dataset knowing that the performance, if anything, is only going to get better than that.
Is there anything fundamentally wrong with this approach from a scientific or philosophical perspective? Any reason why it shouldn’t be done?
Two problems I see with that approach is:
Without a validation set, you don’t know how many epochs to train using the remaining 20%.
When you have a new model, you don’t have any way to compare its performance with the old model unless you keep a held-out test set.
Thanks for your answer.
Just to clarify, I did not mean to keep the model which was already fine tuned with 80% of the data and do some further fine tuning only with the remaining 20%. I meant use the exact same hyper-parameters I have chosen after doing the standard validation process (including number of epochs), and then retrain everything from scratch with those same hyper parameters but with 100% of the data, so I’m not sure if your first point is fully relevant here.
On your second point I agree with you, but as I said, if I use exactly the same hyper parameters which worked best when training with 80% of the data to retrain with 100% of the data from scratch, it’s unlikely that the performance will be worse than what I got with 80% of the data, so I will still know that it’s at least equal if not better than what I got before (although I won’t know how much better), but this would still be enough for me.
Does anyone have further comments based on this clarification?
It might work well, it might not. When you add data to your training run, the model will converge differently because it sees different data and optimizes accordingly. This might be good or bad (it is not a given that it will deterministically end up being better simply because you gave it more data), but the problem is that you simply cannot tell because you have no held-out set anymore. If you do wish to squeeze everything out of your data, cross validation is recommended.
Thanks for your reply. Yes I see your point. Even with cross validation however, you would still end up training the final model on the whole dataset without being able to test it at the very end. If you have a 5-fold cross validation, for example, you might train your model from scratch 5 times, and each time you keep 1/5th (20%) of the data for validation, which changes every time, so that at the end of the 5 runs, each data sample would have been in the validation set once and in the training set four times. Then you can average the metrics from the 5 runs, and ultimately re-train the model with 100% of your data assuming (which is still an assumption) that the averages of the performance metrics that you got across the 5 runs of the cross validation exercise while you trained the model with 80% of the data each time, is a good estimate of the model performance when trained with 100% of the data. So there are still some assumptions in there, but perhaps they’re more justifiable that simply doing what I was proposing earlier.