Are the target tokens (the labels) replaced with the ignore token id somewhere as well? Doesn’t look like it from what I can see … so I’m assuming we need to do that ourselves, and pass the label ids with padding tokens set to -100.
Also, the decoder_input_ids
come back in the form of <eos> <bos> X ...
, but my understanding was always that it should start with <bos>
and the labels
shifted so that <bos>
attempts to predict X[0]
and so forth.