GPTQ+PEFT model running very slowly at inference

Hello there,

I fine-tuned a GPTQ model using PEFT. More precisely, I used the TheBloke/Llama-2-7b-Chat-GPTQ and fine-tuned it using the following lora config.

    config = LoraConfig(
        r=8,
        lora_alpha=32,
        target_modules=["k_proj", "o_proj", "q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )

When running inference with the trained model, it takes almost 30 seconds to only generate one token (vs 0.15 before the fine-tuning).

model = AutoPeftModelForCausalLM.from_pretrained(
    peft_model,
    device_map="cuda",
)
model.eval()

inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
with torch.no_grad():
   generated_ids = model.generate(
       input_ids=inputs.input_ids,
       max_new_tokens=1,
   )
output = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

Note that we can’t use merge_and_unload() with a GPTQ model.

Has anyone encountered the same issue?

Adding information. The inference is slow when using a GTX 1080, but executes at normal speed on an A10. Both machines have the latest version of transformers, peft, and optimum installed, as well as CUDA version 12.

torch: 2.0.1
transformers: 4.33.1
peft: 0.5.0
optimum: 1.12.0
auto-gptq: 0.4.2

I was having a similar issue. My GPTQ + PEFT still runs slower (5 sec vs 3 sec), but it went down from being 30 sec on my machine.

model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"
adapter_folder = "./folder"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config= GPTQConfig(bits=4, disable_exllama=False),
    device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_folder)

Inference:

text = "[INST] Give me a quote about love [/INST]\n"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Thanks for answering.

What kind of change did you apply to get this speed reduction?

I met the same problem, but I use Vicuna-13B-v1.5 model.