Bug in Summarization tutorial

Hacker1337 · October 31, 2023, 1:07pm

Hello, I was reproducing the tutorial Summarization

The code seems to contain the same problem, that is discussed here Decoding error while using DataCollatorForSeq2Seq · Issue #24433 · huggingface/transformers · GitHub
Forgetting to replace -100-s in prediction labels will lead to error
OverflowError: out of range integral type conversion attempted
And in “compute metrics” function this replacement is done only with labels, not with predictions.

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

This is a very tricky bug and probably it would be great to change something in the library, so that people won’t have to do it every time manually. Cuz, as you can see it leads to this hard-to-debug problem.
It’s hard to debug because it throws an error only if the padding was used, which happens pretty randomly.
By default generation length is 20, which is the reason, why most of the time the notebook from the tutorial executes without the error.
But after increasing max_gen_len e.g. to 100 it fails much more often.

training_args = Seq2SeqTrainingArguments(
	...
    generation_max_length=100,
)

alpecevit · March 21, 2024, 12:54pm

Hey @Hacker1337
Did you find a solution for this? Having the same problem.

Hacker1337 · March 21, 2024, 3:48pm

Yes, solution is very simple. As described in the github issue, you have to simply replace -100 with padding token in predictions as well as was done with labels.

Insert this before using predictions values.
predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)

Topic		Replies	Views
How to decode with custom pad tokens 🤗Tokenizers	3	4082	December 22, 2023
Why is the code for DataCollatorForSeq2Seq overwriting the labels? 🤗Transformers	3	1008	August 24, 2021
Error when increasing max_length for tokenizer - OverflowError: out of range integral type conversion attempted 🤗Transformers	0	492	April 18, 2024
Why am I seeing `-100` values in predictions during evaluation with `compute_metrics` inside a language model task? Beginners	2	117	October 15, 2024
[Solved] TypeError: Object of type int64 is not JSON serializable Beginners	1	9674	August 28, 2024

Bug in Summarization tutorial

Related topics