I’m still confused
"
if a model does not have a padding token already (which is common for decoder-only models because they are trained on blocks which do not have any padding). So you never “unlearn” anything.
"
is true, but then during training eos and pad will be masked. So there is a “wrong” distribution shift for generating EOS now. How to fix this? See details above.