Two Whisper classes for generation but same functionalities?

Are there any differences between WhisperForConditionalGeneration and WhisperForCausalLM? From the documentation, they are very similar to each other.

For WhisperForConditionalGeneration, it says:

The Whisper Model with a language modeling head. Can be used for automatic speech recognition. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

And for WhisperForCausalLM:

Whisper decoder with a language modeling head on top (linear layer with weights tied to the input embeddings). This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

Looks like both of them have a language modeling head on the top. But are there any other differences for these classes?

Best

Hi @alerio,

I had the same question, and it turns out that WhisperForCausalLM is the class solely used to load the assistant model for speculative decoding.

Without loading the whole encoder-decoder, WhisperForCausalLM only loads the decoder with a language modeling head on top.

You can see more details from the initial PR from Patrick: [WhisperForCausalLM] Add WhisperForCausalLM for speculative decoding by patrickvonplaten · Pull Request #27195 · huggingface/transformers · GitHub

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.