I need to get the hidden states when the model outputs the next token. I compare model.generate with model forward and find that the speed of model forward call is faster than model.generate.
Example:
Model: llava next
Prompt: <image>\n Summarize it in one word:
model forward call:
inputs = processor(input_prompts, images=batch['image'], return_tensors="pt", padding=True).to(device)
emb = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]
model.generate:
inputs = processor(input_prompts, images=batch['image'], return_tensors="pt", padding=True).to(device)
emb = model.generate(**inputs, max_new_tokens=60, output_hidden_states=True, return_dict_in_generate=True).hidden_states[0][-1][:, -1, :]
model forward call can process 1000 samples in 2 minutes but model.generate needs 20 minutes which is 10x slower than model forward call.