I’ve just started playing with Llama 3 4-bit models. When i use their generate function normally, the model generates the result as expected and stops. But when I include the same code in a function the model generates the response but repeats the response till it reaches the max_new_tokens count.Below are my code and outputs:
models and tokenizers are loaded from
tokenizer=AutoTokenizer.from_pretrained(‘unsloth/llama-3-8b-bnb-4bit’,dtype=“Float16”)
model=AutoModelForCausalLM.from_pretrained(“unsloth/llama-3-8b-bnb-4bit”)
Generate function in a function
Code
def run():
input_prompt=“”“You are a query generator, Generate a query given user query and previous History. Generate only response and nothing else in response field
### History:
{}
### User query:
{}
### Response:
{}”“”
hist=“”“CBSE schools in secunderabad”“”
uquery=“”“ICSE schools in same location”“”
inputs = tokenizer([input_prompt.format(hist, uquery, “”)], return_tensors=“pt”).to(“cuda”)
outputs = model.generate(**inputs, max_new_tokens=len(inputs[‘input_ids’]) + 75, pad_token_id=model.config.eos_token_id)
result_prompt = tokenizer.batch_decode(outputs)[0]
print(result_prompt)
run()
OUTPUT
<|begin_of_text|>You are a query generator, Generate a query given user query and previous History. Generate only response and nothing else in response field
### History:
CBSE schools in secunderabad
### User query:
ICSE schools in same location
### Response:
ICSE schools in secunderabad
### History:
CBSE schools in secunderabad
ICSE schools in same location
### User query:
ICSE schools in same location
### Response:
ICSE schools in secunderabad
CBSE schools in secunderabad
### History:
CBSE schools in secunderabad
Same code outside of function
code
input_prompt=“”"You are a query generator, Generate a query given user query and previous History. Generate only response and nothing else in response field
History:
{}
User query:
{}
Response:
{}“”"
hist=“”“CBSE schools in secunderabad”“”
uquery=“”“ICSE schools in same location”“”
inputs = tokenizer([input_prompt.format(hist, uquery, “”)], return_tensors=“pt”).to(“cuda”)
outputs = model.generate(**inputs, max_new_tokens=len(inputs[‘input_ids’]) + 75, pad_token_id=model.config.eos_token_id)
result_prompt = tokenizer.batch_decode(outputs)[0]
print(result_prompt)
OUTPUT
input_prompt=“”"You are a query generator, Generate a query given user query and previous History. Generate only response and nothing else in response field
History:
{}
User query:
{}
Response:
{}“”"
hist=“”“CBSE schools in secunderabad”“”
uquery=“”“ICSE schools in same location”“”
inputs = tokenizer([input_prompt.format(hist, uquery, “”)], return_tensors=“pt”).to(“cuda”)
outputs = model.generate(**inputs, max_new_tokens=len(inputs[‘input_ids’]) + 75, pad_token_id=model.config.eos_token_id)
result_prompt = tokenizer.batch_decode(outputs)[0]
print(result_prompt)