How to decode with custom pad tokens

pietrolesci · March 8, 2022, 1:15pm

Hi there,

I am implementing a seq2seq model. I am padding the target sequence using

DataCollatorForSeq2Seq(
    tokenizer=t5_tokenizer,
    padding=self.padding,
    label_pad_token_id=-100,
    return_tensors="pt",
)

However, when I decode the target sequence (as I want to compute BLUE score, for example) I get OverflowError: out of range integral type conversion attempted because of the -100.

Is there a direct way to tokenizer.batch_decode passing a custom padding token?

(the alternative would be to manually substitute -100 with 0 before decoding - as done here - but I am looking for something more straightforward, if it exists).

Thanks a lot in advance!

sheoran95 · April 13, 2023, 7:03pm

Hey @pietrolesci
did you find any solution for this? I’m stuck with the same problem.

isspek · November 2, 2023, 12:14pm

I am getting same issue, is there any solution?

Nevermetyou · December 22, 2023, 8:47am

Got same problem

Topic		Replies	Views
T5 decoder predicting tokens even after hitting end of sequence token, i.e </s> 🤗Transformers	4	328	February 26, 2024
Bug in Summarization tutorial Site Feedback	2	1955	March 21, 2024
Seq2seq padding 🤗Transformers	1	69	October 10, 2024
Key error: 0 in DataCollatorForSeq2Seq for BERT Beginners	10	3991	March 13, 2024
-100 in predictions Beginners	1	54	December 20, 2024

How to decode with custom pad tokens

Related topics