Yes, we should manually replace pad token with -100 in labels
.
Ideally yes, it should start with bos
token, but in the original fairseq
implementation the models are trained with <eos> <bos> X ....
, so we have kept it like that for reproducibility.