BLEU evaluation with multiple references

Hi,

I’m trying to train a T5 model on a seq2seq task. The dataset has multiple ground truths for the generation; I split the references to get more training data, and I want to validate and test with all references to calculate the BLEU score, and for validation I want to save the model with the highest BLEU score calculated on the validation set. Now this has two problems:

  1. the common DataCollatorForSeq2Seq can’t deal with that because the label is 3-dimensional: the following is an example:)
'references': [[tensor([3613,    8, 4963,   13,    8, 4033,    1]),
   tensor([  320,    21,     3,     9,  9717,  2195, 17041,     1]),
   tensor([  661,   550,    45,     8,     3, 25895,  3797,     1])],
  [tensor([   34,   808,     3, 13287,  5600,    11,     3,    29,  6833,    81,
             460,   676,    12,   129,    12,  1455,     5,     1]),
   tensor([  34,   47, 3412,   53, 7501,  116,    3,   29, 6833, 3030,   12,  129,
             95,    5,    1]),
   tensor([   8, 1282, 1969,  263,   34, 1256,   21, 9635,   52,    7,   12,  253,
              3,   29, 6833,    5,    1])],
  [tensor([   79,   261, 16352,     7,    12,   199,  2331,  9321,     7,     5,
               1]),
   tensor([  79, 2139, 7208, 7479,   70, 9321,    7,    5,    1]),
   tensor([  79,  356,   95,   46, 1470,  718, 1131, 2269,    5,    1])]]}
  1. I don’t know how to use this configuration in the Trainer API: is there a way not to calculate the validation loss, and only calculate the BLEU score?