Getting -100 in predictions from T5 during compute_metrics

John6666 · October 22, 2025, 8:00am

It seems we need to implement an explicit ignore for -100…

The crash is caused by decoding arrays that contain -100. Hugging Face’s fast tokenizer throws OverflowError when asked to decode negative IDs. -100 appears because labels are padded with -100, and in some trainer/generation code paths the generated sequences you get in compute_metrics can also be padded/aligned and include -100. Replace -100 with the tokenizer’s pad ID in both predictions and labels before batch_decode. (GitHub)

Fix: robust `compute_metrics`

# refs:
# - Decoding -100 causes OverflowError: https://github.com/huggingface/transformers/issues/24433
# - Occurs in translation/summarization flows too: https://github.com/huggingface/transformers/issues/22634

import numpy as np

def compute_metrics(eval_pred):
    if not _state.is_main_process:
        return {'GAS': 0, 'Levensgtein_score': 0, 'Identity_Similarity_Score': 0}

    preds, labels = eval_pred
    # HF may return a tuple; take first element
    if isinstance(preds, (tuple, list)):
        preds = preds[0]

    # If logits slipped in (B, T, V), convert to token ids safely
    if preds.ndim == 3:
        preds = np.argmax(preds, axis=-1)

    # Map ignore index to a real token id for decoding
    ignore = -100
    pad_id = tokenizer.pad_token_id
    preds = np.where(preds != ignore, preds, pad_id)
    labels = np.where(labels != ignore, labels, pad_id)

    pred_seq  = tokenizer.batch_decode(preds,  skip_special_tokens=True)
    label_seq = tokenizer.batch_decode(labels, skip_special_tokens=True)

    gas, i_sim_score, lev_score = [], [], []
    for t, p in zip(label_seq, pred_seq):
        g, s = get_global_alignment_score(t, p, aligner)
        gas.append(g)
        i_sim_score.append(s)
        lev_score.append(get_levenshtein_score(t, p))

    avg_gas = sum(gas) / len(gas)
    avg_lev = sum(lev_score) / len(lev_score)
    avg_i   = sum(i_sim_score) / len(i_sim_score)
    return {'GAS': avg_gas, 'Levensgtein_score': avg_lev, 'Identity_Similarity_Score': avg_i}

Why this works: the tokenizer only accepts valid non-negative token IDs; -100 is an ignore index for loss, not a token. Converting -100 → pad_token_id avoids the integral conversion error during decode. (GitHub)

Additional stability checks

Ensure pad IDs exist and are used for generation padding. T5 has a pad token; the trainer pads generations using the tokenizer/config pad ID. If the tokenizer or pad ID is missing or mismatched, padding can be wrong. (GitHub)
Keep generation length fields valid. Bad generation_config.max_length or None vs integers has caused evaluation-time errors; prefer max_new_tokens or set generation_max_length in Seq2SeqTrainingArguments. (GitHub)
Guard against logits. If preds is 3-D [B,T,V], argmax before decoding, as above. HF discussions show Seq2SeqTrainer returns generated tokens, but other trainers or settings may hand you logits. The guard makes the function safe. (Hugging Face Forums)
Labels are already handled in your code. Do the same for predictions. HF users hit the same pitfall in the summarization tutorial thread. (Hugging Face Forums)

Why `-100` shows up at all

DataCollatorForSeq2Seq pads labels to uniform length with -100 so loss ignores them. If you try to decode those raw IDs, you’ll hit the overflow. Some evaluation paths also align prediction lengths, and negative IDs can leak into what you receive as “predictions,” so always sanitize before decoding. (Niklas Heidloff)

Short, curated references

GitHub issues (primary):

Decoding error from -100 when using DataCollatorForSeq2Seq. Repro + stack traces. (GitHub)
Same overflow in translation flows. Confirms the root cause is decoding invalid IDs. (GitHub)
Generation config length pitfalls during eval. Use max_new_tokens or set max_length correctly. (GitHub)

Hugging Face docs/examples:

Data collators and padding behavior. Background on why labels get -100. (Hugging Face)

HF forum posts:

Tutorial bug report stating you must replace -100 before decoding predictions as well. (Hugging Face Forums)
Discussion showing differences between logits vs generated tokens in evaluation. Helps explain why the ndim guard is useful. (Hugging Face Forums)

This set maps the symptoms to the cause and gives the durable fix: sanitize predictions and labels (-100 → pad_token_id) before batch_decode, and keep generation length fields consistent.

Topic		Replies	Views
Why am I seeing `-100` values in predictions during evaluation with `compute_metrics` inside a language model task? Beginners	2	182	October 15, 2024
Bug in Summarization tutorial Site Feedback	2	2084	March 21, 2024
Problem with Compute Metrics function 🤗Transformers	3	35	September 9, 2025
Warning when adding compute_metrics function to Trainer 🤗Transformers	9	4873	March 3, 2021
Problem with custom metric for custom T5 model Beginners	1	774	October 9, 2023

Getting -100 in predictions from T5 during compute_metrics

Fix: robust compute_metrics

Additional stability checks

Why -100 shows up at all

Short, curated references

Related topics

Fix: robust `compute_metrics`

Why `-100` shows up at all