Hi @alerio,
I had the same question, and it turns out that WhisperForCausalLM
is the class solely used to load the assistant model for speculative decoding.
Without loading the whole encoder-decoder, WhisperForCausalLM
only loads the decoder with a language modeling head on top.
You can see more details from the initial PR from Patrick: [WhisperForCausalLM] Add WhisperForCausalLM for speculative decoding by patrickvonplaten 路 Pull Request #27195 路 huggingface/transformers 路 GitHub