What makes the built-in generate method faster than a crude manual implementation?

Vocabua · January 18, 2024, 12:31pm

I need custom generation logic for my use-case, and it appears that the only way I can achieve this is by re-writing the generate method.

I have started by attempting to create a simple, crude implementation which feeds the input ids through the model to retrieve next token logits, samples the logits, and appends the new token to the input ids in a loop.

However, I have noticed that my implementation is significantly slower than if I use the built-in model.generate() method. Here is a snippet of the two implementations:

import time
torch.manual_seed(0)

# manual
input_ids = tokenizer("test", return_tensors='pt', return_token_type_ids=False).to(0)["input_ids"] # tokenise the prompt
input_tokens = len(input_ids[0])
test_tokens = 10

start_time = time.time()
for _ in range(test_tokens):
  model_output = model(input_ids=input_ids,attention_mask=torch.ones_like(input_ids)) # feed the current generation through the model
  logits = model_output.logits[:, -1, :] # get the next token logits
  
  next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1) # sample the logits to get the next token
  input_ids = torch.cat([input_ids, next_token], dim=-1) # add the new token to the current generation
end_time = time.time()
print(f"manual: {end_time-start_time}")


# auto
torch.manual_seed(0)
input_ids = tokenizer("test", return_tensors='pt', return_token_type_ids=False).to(0) # tokenise the prompt
start_time = time.time()
output = model.generate(
    **input_ids, 
    do_sample=True,
    max_length=input_tokens+test_tokens,
  )
end_time = time.time()
print(f"auto:   {end_time-start_time}")

The time to generate 10 tokens with my implementation is 6.08 seconds, and 1.37 seconds with the model.generate() method.

I would appreciate some pointers as to what differs in the model.generate() method to make it so much faster.

Thanks

dblakely · January 18, 2024, 3:10pm

Hi @Vocabua, I think the two main things are:

(1) Your version doesn’t cache the key and value states, so a lot of computation is being duplicated each iteration of the loop. You should check out this post explaining the key-value cache and it shows code snippets showing how to use it in a simple loop like the one you have.

(2) There’s a lot of other caching/model warmup torch and cuda do behind the scenes, which is probably disadvantaging your first loop. To see how much this matters in your setup, try just swapping the order of the generate code blocks you have and see if the Huggingface version is suddenly much slower. To offset this, you could add a call at the very beginning to “warm up” the model like this:

temp_inputs = tokenizer("test", return_tensors="pt").to(device)
model.generate(**temp_inputs, max_length=10)

Or call the two versions of the generation loop in different scripts.

One other thing to note is to really make it an apples-to-apples comparison, you should do greedy decoding. The Huggingface generate method, when sampling, will do a lot of things like top p filtering, etc by default.

Vocabua · January 18, 2024, 5:32pm

This is perfect, thank you for your help!

system · January 19, 2024, 5:33am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is model.generate slower than model forward call? 🤗Transformers	1	167	August 18, 2024
Generating Once for 16 Tokens is Not Same Generating Single Token 16 Times? 🤗Transformers	4	279	April 17, 2024
Inconsistency in logit values between generation and direct model prediction #31127 🤗Transformers	0	210	May 30, 2024
What does model.generate do I'm not? Beginners	2	2459	July 29, 2024
Speeding up custom data collator 🤗Transformers	0	285	February 2, 2024

What makes the built-in generate method faster than a crude manual implementation?

Related topics