Finetuning whisper attention mask not set and canot be inferred

I was following @sanchit-gandhi 's tutorial (https://huggingface.co/blog/fine-tune-whisper) but I got the following warning “The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input’s attention_mask to obtain reliable results.”.

This warning was introduced in a recent transformers release and I was trying to understand if it can affect the finetuning process. If so, how can we avoid it?

In my knowledge, I think the cause of the warning is the eos_token_id=pad_token_id=50257 in the tokenizer, but it’s fine because we replaced the pad_token_ids with -100 at the line labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100) in the class DataCollatorSpeechSeq2SeqWithPadding.

1 Like

@duonguyen that makes sense, thanks!

I ran into the same error message when running this basic HF Whisper inference codes:
https://huggingface.co/docs/transformers/en/model_doc/whisper#inference

Does anyone know how to fix it? It seems to me in Whisper’s default config: pad_token_id = 50256, eos_token_id = 50256. How can I make them not equal?

This example for Longform transcription doesn’t work out-of-the-box either:
https://huggingface.co/docs/transformers/v4.42.0/en/model_doc/whisper#transformers.WhisperForConditionalGeneration
It runs into the same error message as above.

I added a line: kwargs["attention_mask"] = attention_mask here:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/generation_whisper.py#L487
And it worked and got reasonable ASR results. I thought the transformers library changed some APIs at some point and Whisper didn’t pass in the attention_mask argument properly.

Not sure what’s the optimal solution.