Why am I seeing `-100` values in predictions during evaluation with `compute_metrics` inside a language model task?

GonRos22 · October 14, 2024, 6:15pm

Hi everyone,

I’m running into an issue while evaluating my language model using the compute_metrics function. During evaluation, I see -100 values in the predictions tensor, which is leading to an error:

IndexError: piece id is out of range.

After investigation, I realized these -100 values are causing the tokenizer to fail during batch_decode, as shown in this part of the code:

n_labels = labels.shape[1]
prompt = predictions[:, :-n_labels]
output = predictions[:, -n_labels:]
decoded_prompts = self.tokenizer.batch_decode(prompt, skip_special_tokens=True)
decoded_outputs = self.tokenizer.batch_decode(output, skip_special_tokens=True)

To work around this, I replaced -100 with the padding token ID like this:

if np.any(predictions == -100):
    predictions = np.where(predictions == -100, self.tokenizer.pad_token_id, predictions)

This partially solves the error (the decoded output is gibberish english), and I’m still puzzled as to why -100 values are present in my predictions, given that these values are typically used for labels masking. To be clear, here’s the relevant part of my evaluation loop that leads to this:

loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)

And here is how many -100 values I see per token location in the preds_host tensor:

sum(preds_host == -100)
# output
tensor([ ... 0, 16, 32, 64, 80, ..., 284, 284, 284, 284], device='cuda:1')
sum(preds_host == -100).shape
torch.Size([699])

It looks like different samples in the batch have varying numbers of -100 values in the later parts of the token vector.

My main question: Why would -100 values be showing up in the predictions tensor during evaluation when they’re typically used for label masking? How can I address this more cleanly?

Thanks for any insights!

RaushanTurganbay · October 15, 2024, 7:57am

AFAIK the trainer pads all predictions at the end of the loop, so since each pred can be of variable length it is padded with -100. What you did with replacing them with pad token is the correct way, and if you train a bit more you should start getting non-gibbersih output (as long as the model and train pipeline is okay)

system · October 18, 2024, 10:11am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Warning when adding compute_metrics function to Trainer 🤗Transformers	9	4827	March 3, 2021
Trainer class, compute_metrics and EvalPrediction 🤗Transformers	6	14631	October 28, 2020
Labels in language modeling: which tokens to set to -100? Beginners	1	3481	November 30, 2020
Bug in Summarization tutorial Site Feedback	2	2035	March 21, 2024
Self-pretrained model predicts token with -1 index gap 🤗Transformers	0	670	February 22, 2022

Why am I seeing `-100` values in predictions during evaluation with `compute_metrics` inside a language model task?

Related topics