I’ve been using LLAMA3 Instruct model to do some text generation inference. I’m loading some prompts from the alpaca dataset into a list of messages and passing it to the model to generate. What I notice is that the model doesn’t generate the text for all the prompts but instead it stops after generating for first 50 or so. I assume that this behavior is because the model reaches it’s max new token limits before it goes through all the input prompts. Ideally, I want this max new token limit to be applied for each prompt input rather than to the prompts as a whole. Is there a way to achieve this or am I doing something wrong in preparing the prompts. Attached the snippet of code below
ds = load_dataset(“tatsu-lab/alpaca”)
messages=[{}]
pipeline = transformers.pipeline(
“text-generation”,
model=base_model,
model_kwargs={
"torch_dtype": torch.float16,
#"quantization_config": {"load_in_4bit": True},
"low_cpu_mem_usage": True,
},
device_map= {"": PartialState().process_index} if args["ddp"] else "auto",
)
messages[0][“role”]=“user”
messages[0][“content”]=“”“Generate text for all prompts”“”
j=0
for i in range(0,100):
if(ds[‘train’][i][‘input’]==‘’):
#print(ds[‘train’][i][‘instruction’])
j=j+1
s= messages[0][“content”]
s= s+ ‘\n ‘+ds[‘train’][i][‘instruction’]+’,’
messages[0][“content”] =s
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids(“<|eot_id|>”)
]
with context:
outputs = pipeline(
prompt,
batch_size=int(args[“batch_size”]),
max_new_tokens=8000,
eos_token_id=terminators,
do_sample=True,
temperature=0.4,
top_k=20,
top_p=0.95
)