Llama-2-70b-chat-hf get worse result than Llama-2-70B-Chat-GPTQ

System Info

  • transformers version: 4.36.0
  • Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
  • Python version: 3.9.18
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: 0.25.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response


The official example scripts My own modified scripts


An officially supported task in the examples folder (such as GLUE/SQuAD, …) My own task or dataset (give details below)


I am trying to use Llama-2-70b-chat-hf as zero-shot text classifier for my datasets. Here is my setups.

  1. vLLM + Llama-2-70b-chat-hf
    I used vLLM as my inference engine as run it with:
python api_server.py     --model /nas/lili/models_hf/70B-chat     --tensor-parallel-size 8

api_server.py is the example file and I do not modify anything.

client code:

        data = {
            "prompt": prompt,
            "use_beam_search": False,
            "n": 1,
            "temperature": 0.1,
            "max_tokens": 128,
        res = _post(data)
        return eval(res.content)['text'][0].strip()

And my prompt is:

You will be provided with a product name. The product name will be delimited by 3 backticks, i.e.```. 
Classify the product into a primary category.

Primary categories: 
Clothing, Shoes & Jewelry
Home & Kitchen
Beauty & Personal Care
Sports & Outdoors
Patio, Lawn & Garden
Handmade Products
Grocery & Gourmet Food
Health & Household
Musical Instruments
Toys & Games
Baby Products
Pet Supplies
Tools & Home Improvement
Office Products
Cell Phones & Accessories

Product name:```Cambkatl Men's Funny 3D Fake Abs T-Shirts Casual Short Sleeve Chest Graphic Printed Crewneck Novelty Pullover Tee Tops```.

Only answer the category name, no other words. 

The classification accuracy is 0.352. And I also tried to use the same prompt and parameter(temperature and max_token) to call chatgpt and gpt-4, the got 0.68 and 0.72 respectively.

Llama 2 shouldn’t be significantly worse than ChatGPT. There must be something wrong with it. So I suspect it may be related to vLLM. So I tried the following method.

  1. Transformer + flask
    It’s not a good serving method, maybe I should use tgi. But I think it’s easy for locating problem.
from transformers import LlamaTokenizer, LlamaForCausalLM

tokenizer_path = "/nas/lili/models_hf/70B-chat-hf/"
model_path = "/nas/lili/models_hf/70B-chat-hf/"

tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)

model = LlamaForCausalLM.from_pretrained(

from flask import Flask, request, jsonify
from flask_cors import CORS
from transformers.generation import GenerationConfig

app = Flask(__name__)

@app.route('/generate', methods=['POST'])
def generate():
    json = request.get_json(force=True)
    prompt = json['prompt']
    num_beams = json.get('num_beams')
    temperature = json.get('temperature')
    max_tokens = json.get('max_tokens')
    do_sample = json.get('do_sample')
    top_k = json.get('top_k') or 10
    model_inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
    cfg = GenerationConfig(
        num_beams = num_beams,
        max_new_tokens = max_tokens,
        temperature = temperature,
        do_sample = do_sample,
        top_k = top_k
    output = model.generate(**model_inputs, generation_config=cfg, pad_token_id=tokenizer.eos_token_id)
    input_length = model_inputs["input_ids"].shape[1]
    output = tokenizer.decode(output[0][input_length:], skip_special_tokens=True)
    output = output.strip()

    return jsonify({'text': [output]})

if __name__ == '__main__':
    app.run(host='', port=5000)

And the client code:

        data = {
            "prompt": prompt,
            "do_sample": True, 
            "temperature": 0.1,
            "max_tokens": 128,
        res = _post(data, url=self.url)
        return eval(res.content)['text'][0].strip()

This time I used a large num_beams=5(I should use 1 but I made a mistake)
I used the same prompt as before. And the accuracy is 0.368. It’s not much better than using vLLM(the gain may from large num_beams).

Now it seems there is not the problem of vLLM. What’s wrong with it? Is Llama 2 70b a very bad model? I don’t think so. So I tried the 3rd method.

  1. Transformer(using Llama-2-70B-Chat-GPTQ ) + flask

The setup is the same as method 2, I only change model:

tokenizer_path = "/nas/lili/models_hf/7B-chat/"
model_path = "/nas/lili/models_hf/Llama-2-70B-chat-GPTQ/"

I saved Llama-2-70B-chat-GPTQ by saved_pretrained and forget saved the tokenizer, So I use the tokenizer of Llama2 7B-chat(I think all Llama 2 tokenizer is the same for different mode size). This time I got a better result of 0.56. It’s not good as chatgpt but is significant better than uncompressed Llama-2-70B-chat.

So I am confused that original Llama-2-70B-chat is 20% worse than Llama-2-70B-chat-GPTQ. Method 2 and Method 3 are exactly the same except for different model.

Expected behavior

Llama 2 70b got a similar or better result than Llama-2-70B-chat-GPTQ.