Generate() and automatic truncation of context

DonThomasitos · June 13, 2024, 1:08pm

Hey guys, I have a question regarding language modeling:

I have a decoder-only model (Llama) and want to generate a sequence. During the sequence generation, I want to have a continuously “full” context, equivalent to the model’s maximum context size during training. To my understanding, this would involve truncating the first token after every single token generation.

However, I don’t find this functionality in generate(). Instead, the MaxLengthCriteria assumes that for the generation of N new tokens, my context length is at max (CONTEXT_LEN - N).

The “manual” approach would be to generate one token at a time with generate() and then cut the first token in the sequence every time. But I’m sure this can be done more elegantly?

Thank you.

Topic		Replies	Views
Text generation using LLAMA3 Beginners	0	836	July 24, 2024
Tokenizer behaviour with pipeline 🤗Tokenizers	0	922	August 1, 2023
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6007	April 14, 2024
Issue with max_length 🤗Transformers	1	2467	September 27, 2020
Generation / Inference Models	0	252	December 11, 2023

Generate() and automatic truncation of context

Related topics