We implemented both the llama2_7b and llama2_7b_hf models on a local network to evaluate the performance of the models. Even though the same questions were asked, different results were obtained, and in my case, llama2_7b had higher answer satisfaction.
The implementation code is as follows.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Llama-2-7b-chat-hf”)
tokenizer.save_pretrained(‘Llama2-7b-tokenizer’)
model = AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-2-7b-chat-hf”)
model.save_pretrained(‘Llama2-7b-model’)
When implementing llama2_7b, we used python convert code.
The code is: python convert_llama_weights_to_hf.py
Here is the answer to the same question:
prompt = “What is the capital of South Korea?”
and there was some difference.
In hf model, it answered some strange answer
but in 7b model, it answered correct answer.
I thought these two models are same, but Why do I get different answers to the same question?
Is there anything that I missed?