Unisloth 4-bit Llama models acting weirdly when used in a Function

SuHaiL372 · May 8, 2024, 7:20am

I’ve just started playing with Llama 3 4-bit models. When i use their generate function normally, the model generates the result as expected and stops. But when I include the same code in a function the model generates the response but repeats the response till it reaches the max_new_tokens count.Below are my code and outputs:
models and tokenizers are loaded from

tokenizer=AutoTokenizer.from_pretrained(‘unsloth/llama-3-8b-bnb-4bit’,dtype=“Float16”)

model=AutoModelForCausalLM.from_pretrained(“unsloth/llama-3-8b-bnb-4bit”)
Generate function in a function
Code
def run():
input_prompt=“”“You are a query generator, Generate a query given user query and previous History. Generate only response and nothing else in response field
### History:
{}
### User query:
{}
### Response:
{}”“”
hist=“”“CBSE schools in secunderabad”“”
uquery=“”“ICSE schools in same location”“”
inputs = tokenizer([input_prompt.format(hist, uquery, “”)], return_tensors=“pt”).to(“cuda”)
outputs = model.generate(**inputs, max_new_tokens=len(inputs[‘input_ids’]) + 75, pad_token_id=model.config.eos_token_id)
result_prompt = tokenizer.batch_decode(outputs)[0]
print(result_prompt)
run()
OUTPUT
<|begin_of_text|>You are a query generator, Generate a query given user query and previous History. Generate only response and nothing else in response field
### History:
CBSE schools in secunderabad
### User query:
ICSE schools in same location
### Response:
ICSE schools in secunderabad
### History:
CBSE schools in secunderabad
ICSE schools in same location
### User query:
ICSE schools in same location
### Response:
ICSE schools in secunderabad
CBSE schools in secunderabad
### History:
CBSE schools in secunderabad
Same code outside of function
code
input_prompt=“”"You are a query generator, Generate a query given user query and previous History. Generate only response and nothing else in response field

History:

{}

User query:

{}

Response:

{}“”"

hist=“”“CBSE schools in secunderabad”“”

uquery=“”“ICSE schools in same location”“”

inputs = tokenizer([input_prompt.format(hist, uquery, “”)], return_tensors=“pt”).to(“cuda”)

outputs = model.generate(**inputs, max_new_tokens=len(inputs[‘input_ids’]) + 75, pad_token_id=model.config.eos_token_id)

result_prompt = tokenizer.batch_decode(outputs)[0]

print(result_prompt)
OUTPUT
input_prompt=“”"You are a query generator, Generate a query given user query and previous History. Generate only response and nothing else in response field

History:

{}

User query:

{}

Response:

{}“”"

hist=“”“CBSE schools in secunderabad”“”

uquery=“”“ICSE schools in same location”“”

inputs = tokenizer([input_prompt.format(hist, uquery, “”)], return_tensors=“pt”).to(“cuda”)

outputs = model.generate(**inputs, max_new_tokens=len(inputs[‘input_ids’]) + 75, pad_token_id=model.config.eos_token_id)

result_prompt = tokenizer.batch_decode(outputs)[0]

print(result_prompt)

Topic		Replies	Views
Text generation using LLAMA3 Beginners	0	836	July 24, 2024
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6007	April 14, 2024
Repetitive Token Generation During Evaluation in Fine-Tuned LLaMA Model 🤗Transformers	1	29	March 6, 2025
Making llama text generation, deterministic Models	1	9794	August 16, 2023
Meta-Llama-3-8B-Instruct: "max_new_tokens" is not working for /v1/chat/completions Intermediate	1	823	July 2, 2024

Unisloth 4-bit Llama models acting weirdly when used in a Function

History:

User query:

Response:

History:

User query:

Response:

Related topics