I was wrong in my earlier statements. I was not taking into account that this is about T5 which formulates every problem as a text-to-text problem where the output labels are indeed “text” as taken from the vocab.
I am not sure how you can use weighted cross entropy loss here because the labels are not necessarily just one token (which would be easy). I’ll let @valhalla take this one.
But please do not “topic hijack” other topics (T5 user defined loss function - #14 by peggy).