It seems we need to implement an explicit ignore for -100…
The crash is caused by decoding arrays that contain -100. Hugging Face’s fast tokenizer throws OverflowError when asked to decode negative IDs. -100 appears because labels are padded with -100, and in some trainer/generation code paths the generated sequences you get in compute_metrics can also be padded/aligned and include -100. Replace -100 with the tokenizer’s pad ID in both predictions and labels before batch_decode. (GitHub)
Fix: robust compute_metrics
# refs:
# - Decoding -100 causes OverflowError: https://github.com/huggingface/transformers/issues/24433
# - Occurs in translation/summarization flows too: https://github.com/huggingface/transformers/issues/22634
import numpy as np
def compute_metrics(eval_pred):
if not _state.is_main_process:
return {'GAS': 0, 'Levensgtein_score': 0, 'Identity_Similarity_Score': 0}
preds, labels = eval_pred
# HF may return a tuple; take first element
if isinstance(preds, (tuple, list)):
preds = preds[0]
# If logits slipped in (B, T, V), convert to token ids safely
if preds.ndim == 3:
preds = np.argmax(preds, axis=-1)
# Map ignore index to a real token id for decoding
ignore = -100
pad_id = tokenizer.pad_token_id
preds = np.where(preds != ignore, preds, pad_id)
labels = np.where(labels != ignore, labels, pad_id)
pred_seq = tokenizer.batch_decode(preds, skip_special_tokens=True)
label_seq = tokenizer.batch_decode(labels, skip_special_tokens=True)
gas, i_sim_score, lev_score = [], [], []
for t, p in zip(label_seq, pred_seq):
g, s = get_global_alignment_score(t, p, aligner)
gas.append(g)
i_sim_score.append(s)
lev_score.append(get_levenshtein_score(t, p))
avg_gas = sum(gas) / len(gas)
avg_lev = sum(lev_score) / len(lev_score)
avg_i = sum(i_sim_score) / len(i_sim_score)
return {'GAS': avg_gas, 'Levensgtein_score': avg_lev, 'Identity_Similarity_Score': avg_i}
Why this works: the tokenizer only accepts valid non-negative token IDs; -100 is an ignore index for loss, not a token. Converting -100 → pad_token_id avoids the integral conversion error during decode. (GitHub)
Additional stability checks
- Ensure pad IDs exist and are used for generation padding. T5 has a pad token; the trainer pads generations using the tokenizer/config pad ID. If the tokenizer or pad ID is missing or mismatched, padding can be wrong. (GitHub)
- Keep generation length fields valid. Bad
generation_config.max_lengthorNonevs integers has caused evaluation-time errors; prefermax_new_tokensor setgeneration_max_lengthinSeq2SeqTrainingArguments. (GitHub) - Guard against logits. If
predsis 3-D[B,T,V], argmax before decoding, as above. HF discussions showSeq2SeqTrainerreturns generated tokens, but other trainers or settings may hand you logits. The guard makes the function safe. (Hugging Face Forums) - Labels are already handled in your code. Do the same for predictions. HF users hit the same pitfall in the summarization tutorial thread. (Hugging Face Forums)
Why -100 shows up at all
DataCollatorForSeq2Seqpads labels to uniform length with-100so loss ignores them. If you try to decode those raw IDs, you’ll hit the overflow. Some evaluation paths also align prediction lengths, and negative IDs can leak into what you receive as “predictions,” so always sanitize before decoding. (Niklas Heidloff)
Short, curated references
GitHub issues (primary):
- Decoding error from
-100when usingDataCollatorForSeq2Seq. Repro + stack traces. (GitHub) - Same overflow in translation flows. Confirms the root cause is decoding invalid IDs. (GitHub)
- Generation config length pitfalls during eval. Use
max_new_tokensor setmax_lengthcorrectly. (GitHub)
Hugging Face docs/examples:
- Data collators and padding behavior. Background on why labels get
-100. (Hugging Face)
HF forum posts:
- Tutorial bug report stating you must replace
-100before decoding predictions as well. (Hugging Face Forums) - Discussion showing differences between logits vs generated tokens in evaluation. Helps explain why the ndim guard is useful. (Hugging Face Forums)
This set maps the symptoms to the cause and gives the durable fix: sanitize predictions and labels (-100 → pad_token_id) before batch_decode, and keep generation length fields consistent.