Compute metric on Dev

Hello, I was wondering why notebooks compute blue or rouge on dev data, not on test data? like this notebook

It is common in deep learning to train on a train set, and monitor the loss of a validation set every x epochs or steps, as is done here. This way, you get an intuition of the model’s performance, particularly whether it is overfitting. If the training loss is very low but the validation is high, your model is overfitting. So the dev data here does not give you official test results, since the model is a bit biased towards that dev data: you keep training as long as the training loss and validation loss decreases.

To probe the final performance of your model, you still test it on a held-out set that has never been used before (the test set). In Tensorflow, AFAIK, you can then evaluate on this unseen test set with model.evaluate.

1 Like