Hello, I can’t say I know much about how the models work internally, but I am trying to get the same behaviour I see on the Llama-2-7b-chat-hf example section to the right
I am running the model locally and give both the web API version and mine the following prompt
<<SYS>>
<large chukn of text with facts about apples>
<</SYS>>
[INST]
User: What are the nutrition facts about Apples
[/INST]
The one from the API looks very nice and is exactly what I want, but the local one responds with weird source links and overall is not good.
This is my code
# Load the tokenizer and model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = get_file_contents("prompt.txt") # same prompt
# Tokenize the input conversation
inputs = tokenizer(prompt, return_tensors="pt")
# Generate response from the model
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
repetition_penalty=1.5,
top_p=0.9,
top_k=50,
)
# Decode the model's output and return the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
What am I doing wrong, the resources on my PC don’t allow it to reach 32 GB of ram, it caps at 25-27GB could that be the issue, or that affects only speed