meta-llama/Llama-2-7b-chat-hf weird responses, compared to the ones returned by the HF API

Hello, I can’t say I know much about how the models work internally, but I am trying to get the same behaviour I see on the Llama-2-7b-chat-hf example section to the right

I am running the model locally and give both the web API version and mine the following prompt

<<SYS>>
<large chukn of text with facts about apples>
<</SYS>>

[INST]
User: What are the nutrition facts about Apples
[/INST]

The one from the API looks very nice and is exactly what I want, but the local one responds with weird source links and overall is not good.

This is my code

# Load the tokenizer and model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = get_file_contents("prompt.txt") # same prompt
# Tokenize the input conversation
inputs = tokenizer(prompt, return_tensors="pt")

# Generate response from the model
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    repetition_penalty=1.5,
    top_p=0.9,
    top_k=50,
)

# Decode the model's output and return the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

What am I doing wrong, the resources on my PC don’t allow it to reach 32 GB of ram, it caps at 25-27GB could that be the issue, or that affects only speed

1 Like

or that affects only speed

Basically, this should be the case, and there are few cases where you get half-baked results due to insufficient hardware performance. It’s either it works or it doesn’t, and it’s either fast or slow.

I found the official HF implementation for Llama2. It may be that tokenizer.use_default_system_prompt = False is meaningful.

Since Llama2 has been around for a long time, it has been affected by various HF specification changes, so there is likely to be some confusion about how to use it.