-100 in predictions

I am using the HuggingFace Seq2SeqTrainer with predict_with_generate=True.

I get the value -100 in my predictions during the validation step. This makes the tokenizer fail (and has absolutely no sense).

Why?

I can share the code, but, based on my experience, a long piece of code turns readers away more easily than a short question.

1 Like

Where can I read how and with which tokens the generated sequences are padded?
I cannot find any information on this.

These are the predictions I get (from Seq2SeqTrainer with predict_with_generate=True) when I set max_new_tokens=100:

preds[78]: [    0     3     2  3247  3321  3155     3     2  1018  6327  3155     3
     2  1018  6327  3155    37     3    75    23 17436    19     3     9
     3   729   302    13     3    75    23 17436     7    24   619    16
     8     3   729   302    96   254    23 17436   121     1  -100  -100
  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100
  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100
  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100
  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100  -100
  -100  -100  -100  -100]

The generated output is padded with the label_pad_id=-100 instead of the tokenizer padding token. Why?

When I use pure PyTorch code the predictions are padded with the token “0”.

1 Like