The old code in examples/seq2seq/finetune.py:
decoder_input_ids = labels[:, 1:]
deletes a token if there is no bos at the start of the tensor.
The new code
decoder_input_ids = shift_tokens_right(labels)
- Doesn’t delete tokens.
- helps metrics of fine-tuned mbart and pegasus (don’t use BOS)
- does not change metrics of finetuned models that use bos.
This makes sense. Deleting the first word of every tgt example makes finetuning worse.
What I don’t understand:
bart’s shift_tokens_right
wraps eos token around to the 0th position of decoder_input_ids.
This means that models are finetuned with the equivalent of decoder_start_token_id=eos_token_id
.
However, when it comes time to evaluate mbart/pegasus/marian, having decoder_start_token_id=pad_token_id
produces better metrics than decoder_start_token_id=eos_token_id
. For the bart variants, decoder_start_token_id=eos_token_id works best.
Additionally, switch to the t5 shift_tokens_right
functionality, which puts decoder_start_token_id
at position 0 for finetuning doesn’t improve metrics at all.
Does this make sense? What am I missing?