I am using the HuggingFace Seq2SeqTrainer
with predict_with_generate=True
.
I get the value -100
in my predictions during the validation step. This makes the tokenizer fail (and has absolutely no sense).
Why?
I can share the code, but, based on my experience, a long piece of code turns readers away more easily than a short question.
1 Like
Where can I read how and with which tokens the generated sequences are padded?
I cannot find any information on this.
These are the predictions I get (from Seq2SeqTrainer
with predict_with_generate=True
) when I set max_new_tokens=100:
preds[78]: [ 0 3 2 3247 3321 3155 3 2 1018 6327 3155 3
2 1018 6327 3155 37 3 75 23 17436 19 3 9
3 729 302 13 3 75 23 17436 7 24 619 16
8 3 729 302 96 254 23 17436 121 1 -100 -100
-100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
-100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
-100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
-100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100
-100 -100 -100 -100]
The generated output is padded with the label_pad_id=-100
instead of the tokenizer padding token. Why?
When I use pure PyTorch code the predictions are padded with the token “0”.
1 Like