Difference between dev and test in production environment

Hi. I’m currently developing a neural IR service. During developing the service I confuse the right way of using the dev/test set.

  1. We are at the beginning of the project so we don’t have any model. I collected data and trained the five models. During the training, I evaluated the five models via the dev set (at every epoch end). After the training models, I tested the models via test set. I get the below result from training. which model should I choose for our service?
Model Train Dev Test
A 1 2 5
A’ 2 4 4
A’’ 3 3 3
B 4 1 2
C 5 5 1

(Note that A’ and A’’ means the hyper-parameter tuned version of model A and the number means the rank of the metric such as accuracy. Model A shows higher performance than A’ in the dev set)

I think that the dev set and the test set do exactly the same thing here. So I don’t need the test set or dev set (because the distribution of the dev and the test is the same and the models didn’t see the dev set during training, as well as test set). Am I correct?

  1. We get a base model (B) from the above and we want to develop it. 1 year later, I decide to upgrade the service.
Model Train Dev Test
B (Base) - - 5
C 2 4 4
D 3 3 3
E 4 1 2
F 5 5 1

I train the new models by using the train set and get metrics via the dev set. When I get the winner in the tes set (which is F), I compare the performance of the base model (B) and the new winner via the test set. Am I correct?