Generate without using the generate method

nielsr · November 5, 2021, 9:30am

Posting this here for visibility. What if you want to decode the output of a generative seq2seq model (like T5, BART, etc.) yourself, without using the .generate() method? The code example below illustrates this.

Suppose that the model is given a long text, for which it needs to generate a summary. We illustrate here how to manually decode the generated ids autoregressively. In each iteration, we add the predicted token id by the model to the decoder_input_ids, which are then fed as input to the next time step. At the beginning, we only feed the decoder_start_token_id to the decoder of the model.

from transformers import BartTokenizer, BartForConditionalGeneration
import torch

model_name = "sshleifer/distilbart-cnn-6-6"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

text = """The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."""

input_ids = tokenizer(text, return_tensors="pt").input_ids

decoder_input_ids = [model.config.decoder_start_token_id]
predicted_ids = []
for i in range(20): 
    outputs = model(input_ids=input_ids, decoder_input_ids=torch.tensor([decoder_input_ids]))
    logits = outputs.logits[:,i,:]
    # perform argmax on the last dimension (i.e. greedy decoding)
    predicted_id = logits.argmax(-1)
    predicted_ids.append(predicted_id.item())
    print(tokenizer.decode([predicted_id.squeeze()]))
    # add predicted id to decoder_input_ids
    decoder_input_ids = decoder_input_ids + [predicted_id]

This will print:

The
 E
iff
el
 Tower
 is
 324
 metres
 (
1
,
06
3
 ft
)
 tall
,
 about
 the
 same

The final result can also be printed using print(tokenizer.decode(predicted_ids)):

The Eiffel Tower is 324 metres (1,063 ft) tall, about the same

Note that we’ve only done 20 time steps here. Normally, one continues until the model generates the EOS (end of sequence) token, which for BART is </s>.

marshmellow77 · November 26, 2021, 12:56pm

Hi Niels, thanks for sharing the code. Would you mind also sharing some examples of situations in which you would prefer not to use the .generate() method?

rupeshpoojary97 · April 21, 2022, 10:11am

If you are deploying your model on triton server and you are inferencing through triton client there is no generate method for your help, I used this method to decode my output through the model.

ashutoshml · December 13, 2022, 5:02am

Let us suppose we want to restrict our vocabulary to some specific set of tokens (that changes dynamically with each time step). What is the best way of incorporating that? Other than decoding each token individually?

Ujan · August 1, 2023, 6:18am

Thanks for the post. I’m finding this to be much slower than the generate( ) function for my use case (whisper model). Is this expected?

yblainm · September 2, 2023, 2:22pm

That might be because this doesn’t cache the hidden states when generating, if I understand correctly. You would need to keep past_key_values or something like that by making sure use_cache is True in your model config.

Otherwise in the above snippet you’re re-computing the entire past sequence every time you want a next token, despite the fact that causal attention means all the past hidden states are constant.

espoir · September 15, 2023, 8:02am

This may help a lot. What if the decoding is using beam search?

sreebee11 · September 6, 2024, 12:40am

Hi! Your reply is very helpful on this. I am trying to do something similar but with the more recent VLMs like Qwen-VL, Llava, etc. I am trying to understand a few things in this:

In the output logits, what does the dimension of the sequence_length denote? As in, what is that sequence? Is it only the input or the output or is it the whole sequence?
In your code, in the loop, why do we do output.logits[:, i, :], ie, why do we slice using i?

L-jasmine · January 17, 2025, 2:36pm

Thank you so much for your post

Topic		Replies	Views
Generate 'continuation' for seq2seq models Intermediate	1	1864	February 22, 2021
Model with Genrate method to torchscript Models	2	38	March 12, 2025
Using generate() method with decoder Models	0	566	January 16, 2022
How to get 'sequences_scores' from 'scores' in 'generate()' method Beginners	6	6243	May 2, 2023
Is this the right way prompt summarization with BART? 🤗Transformers	1	2084	March 18, 2023

Generate without using the generate method

Related topics