The old code in examples/seq2seq/finetune.py:
decoder_input_ids = labels[:, 1:]
deletes a token if there is no bos at the start of the tensor.
The new code
decoder_input_ids = shift_tokens_right(labels)
- Doesn’t delete tokens.
- helps metrics of fine-tuned mbart and pegasus (don’t use BOS)
- does not change metrics of finetuned models that use bos.
This makes sense. Deleting the first word of every tgt example makes finetuning worse.
What I don’t understand:
shift_tokens_right wraps eos token around to the 0th position of decoder_input_ids.
This means that models are finetuned with the equivalent of
However, when it comes time to evaluate mbart/pegasus/marian, having
decoder_start_token_id=pad_token_id produces better metrics than
decoder_start_token_id=eos_token_id. For the bart variants, decoder_start_token_id=eos_token_id works best.
Additionally, switch to the t5
shift_tokens_right functionality, which puts
decoder_start_token_id at position 0 for finetuning doesn’t improve metrics at all.
Does this make sense? What am I missing?