Is model.generate slower than model forward call?

I need to get the hidden states when the model outputs the next token. I compare model.generate with model forward and find that the speed of model forward call is faster than model.generate.

Example:
Model: llava next
Prompt: <image>\n Summarize it in one word:
model forward call:

inputs = processor(input_prompts, images=batch['image'], return_tensors="pt", padding=True).to(device)
emb = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]

model.generate:

inputs = processor(input_prompts, images=batch['image'], return_tensors="pt", padding=True).to(device)
emb = model.generate(**inputs, max_new_tokens=60, output_hidden_states=True, return_dict_in_generate=True).hidden_states[0][-1][:, -1, :]

model forward call can process 1000 samples in 2 minutes but model.generate needs 20 minutes which is 10x slower than model forward call.

Expected. model.generate() is running inference up to 60 times in a loop to do auto-regressive inference, whereas running the model can be easily parallelized. If you want to speed up generation, look into something like VLLM for inference.