I am trying to finetune a model for a specific domain, for a specific task. Due to the memory constraints, I want as small a model as possible. Now I managed to finetune some smaller Bloomz models using peft and QLoRA. Then, I am using a pipeline to infer these models.
The problem is, by default the model generates very few (1-2) new tokens and stops generating. If I increase the “max_new_tokens” parameter, the model generates first few (again, 1-2) tokens correctly and then gives some random text over and over repeatedly, until it reaches “max_new_tokens” value.
But, if I run the pipeline inference without “max_new_tokens”, in a loop, where last generated output is the input for next iteration, It generates correct outputs. But it takes too long to generate the total required output since the inference runs in a loop.
I also tried increasing “max_length”, but the same thing happens in this case too.
My question is, is there any parameter in model generation config which takes care of this? Or alternatively, is there any way to increase the length of output generated by the model without affecting the quality while training? Is there a control parameter for context considered in one single generation that I am missing?
Any help on this is appreciated.