Why am I seeing `-100` values in predictions during evaluation with `compute_metrics` inside a language model task?

Hi everyone,

I’m running into an issue while evaluating my language model using the compute_metrics function. During evaluation, I see -100 values in the predictions tensor, which is leading to an error:

IndexError: piece id is out of range.

After investigation, I realized these -100 values are causing the tokenizer to fail during batch_decode, as shown in this part of the code:

n_labels = labels.shape[1]
prompt = predictions[:, :-n_labels]
output = predictions[:, -n_labels:]
decoded_prompts = self.tokenizer.batch_decode(prompt, skip_special_tokens=True)
decoded_outputs = self.tokenizer.batch_decode(output, skip_special_tokens=True)

To work around this, I replaced -100 with the padding token ID like this:

if np.any(predictions == -100):
    predictions = np.where(predictions == -100, self.tokenizer.pad_token_id, predictions)

This partially solves the error (the decoded output is gibberish english), and I’m still puzzled as to why -100 values are present in my predictions, given that these values are typically used for labels masking. To be clear, here’s the relevant part of my evaluation loop that leads to this:

loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)

And here is how many -100 values I see per token location in the preds_host tensor:

sum(preds_host == -100)
# output
tensor([ ... 0, 16, 32, 64, 80, ..., 284, 284, 284, 284], device='cuda:1')
sum(preds_host == -100).shape
torch.Size([699])

It looks like different samples in the batch have varying numbers of -100 values in the later parts of the token vector.

My main question: Why would -100 values be showing up in the predictions tensor during evaluation when they’re typically used for label masking? How can I address this more cleanly?

Thanks for any insights!

1 Like

AFAIK the trainer pads all predictions at the end of the loop, so since each pred can be of variable length it is padded with -100. What you did with replacing them with pad token is the correct way, and if you train a bit more you should start getting non-gibbersih output (as long as the model and train pipeline is okay)

2 Likes

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.