Is model.generate slower than model forward call?

geekifan · August 18, 2024, 3:25pm

I need to get the hidden states when the model outputs the next token. I compare model.generate with model forward and find that the speed of model forward call is faster than model.generate.

Example:
Model: llava next
Prompt: <image>\n Summarize it in one word:
model forward call:

inputs = processor(input_prompts, images=batch['image'], return_tensors="pt", padding=True).to(device)
emb = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, -1, :]

model.generate:

inputs = processor(input_prompts, images=batch['image'], return_tensors="pt", padding=True).to(device)
emb = model.generate(**inputs, max_new_tokens=60, output_hidden_states=True, return_dict_in_generate=True).hidden_states[0][-1][:, -1, :]

model forward call can process 1000 samples in 2 minutes but model.generate needs 20 minutes which is 10x slower than model forward call.

darknoon · August 18, 2024, 6:38pm

Expected. model.generate() is running inference up to 60 times in a loop to do auto-regressive inference, whereas running the model can be easily parallelized. If you want to speed up generation, look into something like VLLM for inference.

Topic		Replies	Views
What makes the built-in generate method faster than a crude manual implementation? 🤗Transformers	3	1917	January 19, 2024
Model.generate() is extremely slow while using beam search 🤗Transformers	2	5376	July 24, 2022
Why Tensorflow Models are way slower than Pytorch models, for autoregressive modeling? 🤗Transformers	10	2101	July 25, 2022
MBART50 .generate() is very slow Beginners	0	660	July 21, 2021
Avoid recalculating hidden states between generate calls? 🤗Transformers	3	1185	March 30, 2023

Is model.generate slower than model forward call?

Related topics