What makes the built-in generate method faster than a crude manual implementation?

I need custom generation logic for my use-case, and it appears that the only way I can achieve this is by re-writing the generate method.

I have started by attempting to create a simple, crude implementation which feeds the input ids through the model to retrieve next token logits, samples the logits, and appends the new token to the input ids in a loop.

However, I have noticed that my implementation is significantly slower than if I use the built-in model.generate() method. Here is a snippet of the two implementations:

import time
torch.manual_seed(0)

# manual
input_ids = tokenizer("test", return_tensors='pt', return_token_type_ids=False).to(0)["input_ids"] # tokenise the prompt
input_tokens = len(input_ids[0])
test_tokens = 10

start_time = time.time()
for _ in range(test_tokens):
  model_output = model(input_ids=input_ids,attention_mask=torch.ones_like(input_ids)) # feed the current generation through the model
  logits = model_output.logits[:, -1, :] # get the next token logits
  
  next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1) # sample the logits to get the next token
  input_ids = torch.cat([input_ids, next_token], dim=-1) # add the new token to the current generation
end_time = time.time()
print(f"manual: {end_time-start_time}")


# auto
torch.manual_seed(0)
input_ids = tokenizer("test", return_tensors='pt', return_token_type_ids=False).to(0) # tokenise the prompt
start_time = time.time()
output = model.generate(
    **input_ids, 
    do_sample=True,
    max_length=input_tokens+test_tokens,
  )
end_time = time.time()
print(f"auto:   {end_time-start_time}")

The time to generate 10 tokens with my implementation is 6.08 seconds, and 1.37 seconds with the model.generate() method.

I would appreciate some pointers as to what differs in the model.generate() method to make it so much faster.

Thanks

Hi @Vocabua, I think the two main things are:

(1) Your version doesn’t cache the key and value states, so a lot of computation is being duplicated each iteration of the loop. You should check out this post explaining the key-value cache and it shows code snippets showing how to use it in a simple loop like the one you have.

(2) There’s a lot of other caching/model warmup torch and cuda do behind the scenes, which is probably disadvantaging your first loop. To see how much this matters in your setup, try just swapping the order of the generate code blocks you have and see if the Huggingface version is suddenly much slower. To offset this, you could add a call at the very beginning to “warm up” the model like this:

temp_inputs = tokenizer("test", return_tensors="pt").to(device)
model.generate(**temp_inputs, max_length=10)

Or call the two versions of the generation loop in different scripts.

One other thing to note is to really make it an apples-to-apples comparison, you should do greedy decoding. The Huggingface generate method, when sampling, will do a lot of things like top p filtering, etc by default.

1 Like

This is perfect, thank you for your help!

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.