Validation loss vs ROUGE (mismatch)

Hi there,
I’m experiencing an unexpected behaviour working on a summarization model:
As I increase my training samples, as expected, my validation loss decreases BUT the ROUGE metrics do not improve (do not increase), so the best model based on validation loss is not the best model based on ROUGE, they are not even close: a model with a quite larger validation loss is the one with the best ROUGE scores.

Can anyone explaina bit this? How normal is it?

Thank you guys

What is your objective function for validation loss? I think the answer to your question depends on several factors, the biggest of which are: the kind of rouge you are using (R1 vs. R2 vs. RL) and the score range (high vs. low).

I am using bigbird-pegasus model pretrained with bigpatent. So I think my loss is the probability of the test sequences.
I understand that having improved probability of the test sequences should lead to having improved ROUGE scores when generating. Isn’t it?

I reopen this as I still fave doubts…

My training log is as following:

Epoch	Training Loss	Validation Loss	Rouge1	      Rouge2	Rougel	   Rougelsum	Gen Len
1	    1.579400	        1.244211	37.884100	18.394000	28.618100	32.514900	141.280000
2	    1.244800	        1.071133	51.618500	36.974200	45.075100	48.752200	108.360000
3	    1.079600	        1.005131	51.801100	36.982600	45.784400	48.130700	120.120000
4	    0.999900	        0.964689	56.135800	41.890200	50.552600	52.726200	105.640000
5	    0.914600	        0.948041	56.082700	42.281500	50.631100	53.256300	122.680000
6	    0.884700	        0.918176	55.765100	41.875500	49.988400	51.950800	129.200000
7	    0.813400	        0.919715	58.670900	44.651300	52.030800	55.370900	115.400000
8	    0.781700	        0.907360	62.344000	48.004800	55.436600	58.767100	104.600000
9	    0.757800	        0.901644	61.207000	47.111100	54.272100	57.602800	110.560000

10	    0.728700	        0.897729	62.947600	49.205700	56.050100	59.417000	112.160000

11	    0.704600	        0.904180	62.263900	48.487600	55.553800	58.811700	118.320000
12	    0.674800	        0.901277	63.497400	49.231000	56.086000	59.940400	117.080000
13	    0.652400	        0.905896	62.673300	49.047100	55.424800	58.997700	117.640000
14	    0.649100	        0.906565	62.326100	47.758000	55.396700	58.827100	115.360000
15	    0.620000	        0.911784	62.418500	48.250400	55.677100	58.524300	119.800000
16	    0.599300	        0.915148	64.097100	50.649800	57.140800	59.962800	105.120000

You can see that as validation loss rises up due to overfitting, ROUGE metrics improve so I get opposed criteria as which model to choose…

Any idea on how to approach this?

Have you handled this problem or not? I would love to know.