System Info
transformers
version: 4.36.0- Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
- Python version: 3.9.18
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response
Information
The official example scripts My own modified scripts
Tasks
An officially supported task in the examples
folder (such as GLUE/SQuAD, …) My own task or dataset (give details below)
Reproduction
I am trying to use Llama-2-70b-chat-hf as zero-shot text classifier for my datasets. Here is my setups.
- vLLM + Llama-2-70b-chat-hf
I used vLLM as my inference engine as run it with:
python api_server.py --model /nas/lili/models_hf/70B-chat --tensor-parallel-size 8
api_server.py is the example file and I do not modify anything.
client code:
data = {
"prompt": prompt,
"use_beam_search": False,
"n": 1,
"temperature": 0.1,
"max_tokens": 128,
}
res = _post(data)
return eval(res.content)['text'][0].strip()
And my prompt is:
You will be provided with a product name. The product name will be delimited by 3 backticks, i.e.```.
Classify the product into a primary category.
Primary categories:
Clothing, Shoes & Jewelry
Automotive
Home & Kitchen
Beauty & Personal Care
Electronics
Sports & Outdoors
Patio, Lawn & Garden
Handmade Products
Grocery & Gourmet Food
Health & Household
Musical Instruments
Toys & Games
Baby Products
Pet Supplies
Tools & Home Improvement
Appliances
Office Products
Cell Phones & Accessories
Product name:```Cambkatl Men's Funny 3D Fake Abs T-Shirts Casual Short Sleeve Chest Graphic Printed Crewneck Novelty Pullover Tee Tops```.
Only answer the category name, no other words.
The classification accuracy is 0.352. And I also tried to use the same prompt and parameter(temperature and max_token) to call chatgpt and gpt-4, the got 0.68 and 0.72 respectively.
Llama 2 shouldn’t be significantly worse than ChatGPT. There must be something wrong with it. So I suspect it may be related to vLLM. So I tried the following method.
- Transformer + flask
It’s not a good serving method, maybe I should use tgi. But I think it’s easy for locating problem.
from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer_path = "/nas/lili/models_hf/70B-chat-hf/"
model_path = "/nas/lili/models_hf/70B-chat-hf/"
tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)
model = LlamaForCausalLM.from_pretrained(
model_path,
#load_in_8bit=True,
#torch_dtype=torch.float16,
device_map="auto",
)
from flask import Flask, request, jsonify
from flask_cors import CORS
from transformers.generation import GenerationConfig
app = Flask(__name__)
CORS(app)
@app.route('/generate', methods=['POST'])
def generate():
json = request.get_json(force=True)
prompt = json['prompt']
num_beams = json.get('num_beams')
temperature = json.get('temperature')
max_tokens = json.get('max_tokens')
do_sample = json.get('do_sample')
top_k = json.get('top_k') or 10
model_inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
cfg = GenerationConfig(
num_beams = num_beams,
max_new_tokens = max_tokens,
temperature = temperature,
do_sample = do_sample,
top_k = top_k
)
output = model.generate(**model_inputs, generation_config=cfg, pad_token_id=tokenizer.eos_token_id)
input_length = model_inputs["input_ids"].shape[1]
output = tokenizer.decode(output[0][input_length:], skip_special_tokens=True)
output = output.strip()
return jsonify({'text': [output]})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
And the client code:
data = {
"prompt": prompt,
"do_sample": True,
"temperature": 0.1,
"max_tokens": 128,
"num_beams":5
}
res = _post(data, url=self.url)
return eval(res.content)['text'][0].strip()
This time I used a large num_beams=5(I should use 1 but I made a mistake)
I used the same prompt as before. And the accuracy is 0.368. It’s not much better than using vLLM(the gain may from large num_beams).
Now it seems there is not the problem of vLLM. What’s wrong with it? Is Llama 2 70b a very bad model? I don’t think so. So I tried the 3rd method.
- Transformer(using Llama-2-70B-Chat-GPTQ ) + flask
The setup is the same as method 2, I only change model:
tokenizer_path = "/nas/lili/models_hf/7B-chat/"
model_path = "/nas/lili/models_hf/Llama-2-70B-chat-GPTQ/"
I saved Llama-2-70B-chat-GPTQ by saved_pretrained and forget saved the tokenizer, So I use the tokenizer of Llama2 7B-chat(I think all Llama 2 tokenizer is the same for different mode size). This time I got a better result of 0.56. It’s not good as chatgpt but is significant better than uncompressed Llama-2-70B-chat.
So I am confused that original Llama-2-70B-chat is 20% worse than Llama-2-70B-chat-GPTQ. Method 2 and Method 3 are exactly the same except for different model.
Expected behavior
Llama 2 70b got a similar or better result than Llama-2-70B-chat-GPTQ.