Output token lengths of smaller models

himankk · October 30, 2023, 9:53am

I am trying to finetune a model for a specific domain, for a specific task. Due to the memory constraints, I want as small a model as possible. Now I managed to finetune some smaller Bloomz models using peft and QLoRA. Then, I am using a pipeline to infer these models.

The problem is, by default the model generates very few (1-2) new tokens and stops generating. If I increase the “max_new_tokens” parameter, the model generates first few (again, 1-2) tokens correctly and then gives some random text over and over repeatedly, until it reaches “max_new_tokens” value.

But, if I run the pipeline inference without “max_new_tokens”, in a loop, where last generated output is the input for next iteration, It generates correct outputs. But it takes too long to generate the total required output since the inference runs in a loop.

I also tried increasing “max_length”, but the same thing happens in this case too.

My question is, is there any parameter in model generation config which takes care of this? Or alternatively, is there any way to increase the length of output generated by the model without affecting the quality while training? Is there a control parameter for context considered in one single generation that I am missing?

Any help on this is appreciated.

Topic		Replies	Views
The current text generation call will exceed the model's predefined maximum length 🤗Transformers	1	2451	April 16, 2025
Accelerated Inference API not taking parameters? Intermediate	5	1634	October 26, 2022
BART max_new_tokens in generate function Models	2	180	May 11, 2024
Limit max # of tokens for inference in pipeline? Beginners	0	1080	April 7, 2023
How do I increase max_new_tokens Beginners	3	29254	August 19, 2023

Output token lengths of smaller models

Related topics