Hey guys, I have a question regarding language modeling:
I have a decoder-only model (Llama) and want to generate a sequence. During the sequence generation, I want to have a continuously “full” context, equivalent to the model’s maximum context size during training. To my understanding, this would involve truncating the first token after every single token generation.
However, I don’t find this functionality in generate().
Instead, the MaxLengthCriteria assumes that for the generation of N new tokens, my context length is at max (CONTEXT_LEN - N)
.
The “manual” approach would be to generate one token at a time with generate() and then cut the first token in the sequence every time. But I’m sure this can be done more elegantly?
Thank you.