Expected workflow -100 and padding in labels in seq2seq?

In a seq2seq model like BART, we may want to ignore padding tokens in our loss - just like in other classification tasks. This is typically done by setting padding tokens to -100 in the data collator, which is then ignored by CrossEntropyLoss automatically.

In a seq2seq generative model, this approach can also be taken as long as predict_with_generate=False. In such an event, we are doing teacher forcing and we create as many outputs as there are labels so no additional padding is necessary.

However, in the event that we do use predict_with_generate, we do something different: calculate the loss (as before), but also generatively produce tokens. (Although I understand the reasoning, it is a bit confusing that the loss that you get back is not directly related to the predicted tokens, as they are generated in different ways.)

It is likely that the model generates more/less tokens than the labels, in which event the Seq2Seq class will automatically add padding to the labels if needed. And here is where my question comes from: at this stage, we have already set padding tokens in our labels to -100, effectively replacing padding tokens. But now we add padding tokens again at the end! This means that we will have a tensor that for many sequences ends in [..., -100, -100, ..., 1, 1, 1, 1], where 1 is the padding token in this example.

My question is what the typically flow is to use this effectively, and whether it would not make more sense to have a new argument to the trainer “generation_label_padding” or something. My guess is that the current approach requires me to manually set all padding tokens to -100 (or all -100 to the padding ID) in compute_metrics before doing anything else, and this would only be necessary if using predict_with_generate. Is that correct?