mvonwyl
September 6, 2023, 3:50pm
1
Hello there,
I fine-tuned a GPTQ model using PEFT. More precisely, I used the TheBloke/Llama-2-7b-Chat-GPTQ
and fine-tuned it using the following lora config.
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["k_proj", "o_proj", "q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
When running inference with the trained model, it takes almost 30 seconds to only generate one token (vs 0.15 before the fine-tuning).
model = AutoPeftModelForCausalLM.from_pretrained(
peft_model,
device_map="cuda",
)
model.eval()
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
with torch.no_grad():
generated_ids = model.generate(
input_ids=inputs.input_ids,
max_new_tokens=1,
)
output = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
Note that we can’t use merge_and_unload()
with a GPTQ model.
Has anyone encountered the same issue?
mvonwyl
September 7, 2023, 3:24pm
2
Adding information. The inference is slow when using a GTX 1080, but executes at normal speed on an A10. Both machines have the latest version of transformers, peft, and optimum installed, as well as CUDA version 12.
torch: 2.0.1
transformers: 4.33.1
peft: 0.5.0
optimum: 1.12.0
auto-gptq: 0.4.2
I was having a similar issue. My GPTQ + PEFT still runs slower (5 sec vs 3 sec), but it went down from being 30 sec on my machine.
model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"
adapter_folder = "./folder"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config= GPTQConfig(bits=4, disable_exllama=False),
device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_folder)
Inference:
text = "[INST] Give me a quote about love [/INST]\n"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
mvonwyl
September 14, 2023, 3:25pm
4
Thanks for answering.
What kind of change did you apply to get this speed reduction?
I met the same problem, but I use Vicuna-13B-v1.5 model.