Generate() and automatic truncation of context

Hey guys, I have a question regarding language modeling:

I have a decoder-only model (Llama) and want to generate a sequence. During the sequence generation, I want to have a continuously “full” context, equivalent to the model’s maximum context size during training. To my understanding, this would involve truncating the first token after every single token generation.

However, I don’t find this functionality in generate(). Instead, the MaxLengthCriteria assumes that for the generation of N new tokens, my context length is at max (CONTEXT_LEN - N).

The “manual” approach would be to generate one token at a time with generate() and then cut the first token in the sequence every time. But I’m sure this can be done more elegantly?

Thank you.