LLaMa3.1 8B Instruct Prompt Tuning for Text Classification doesn't improve test accuracy

rbelanec · September 26, 2024, 5:33pm

Hello everyone!

I am trying to fine-tune LLaMa3.1 8B Instruct for text classification with prompt tuning from peft library (prepending a trainable matrix before input embeddings). For instance, I am using the QNLI dataset to classify whether a question and sentence are in entailment.

I have preprocessed the dataset so that each data looks like the following:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Classify the question and sentence pair into labels: entailment, not entailment. Reply only the corresponding label.
question: What is the name of a former Asian Portuguese colony?
sentence: The country has a tiny Chinese population.
label:<|eot_id|><|start_header_id|>assistant<|end_header_id|>

not entailment<|eot_id|>

In test data, I remove the label and leave only the generation prompt.

When I evaluate the zero-shot performance of LLaMa3.1 with the text-generation pipeline I get around 74% accuracy (which is already really good).

For training, I am using SFTTrainer and SFTConfig. You can find my full config here (there are also some custom parameters, but mostly nothing out of ordinary).

The valid and train loss go down to less than 1 after one epoch. After one epoch I evaluated the test set again and the accuracy was still around 74%.

My question is does the LLaMa3 model ignore the soft-prompt? Or was it just undertrained and I should keep training it for more epochs? Since the loss is decreasing I suppose that it is not ignoring it during training.

Is it even possible to use prompt tuning with instruction fine-tuned models? I also tried manual hyperparameter tunning (mostly learning rate and soft-prompt length) but the results were mostly the same.

If anybody has more experience with prompt tuning and instruction-tuned autoregressive models I would be thankful if they could point me in the right direction.

Thank you!

edits: added info about pipelines that I thought may be useful

rbelanec · September 30, 2024, 12:32pm

So I was able to fix this issue!

I have been using the peft library of version 0.12.0. In this version the problem was in the implementation of prepare_inputs_for_generation of the PeftModelForCausalLM class. As you can see from this line, the soft-prompt is only added when “past_key_values” are None. I fixed this by using use_cache=False and a quick if statement:

if model_kwargs["use_cache"] == False:
            model_kwargs["past_key_values"] = None

In version 0.13.0 it has been already fixed by adding new flag require_prompt_injection that is defined in this line. This allows to use soft-prompt-based methods also with caches. I haven’t tested it yet tho. Hope this helps someone, It took me a day or so to find out

John6666 · September 30, 2024, 12:54pm

There are a lot of people using that version of the PEFT library in Spaces. And with LLM.
The good news is that there is a fixed version.

system · October 1, 2024, 12:54am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fine-tuning don't work / bad results Beginners	5	1692	January 15, 2025
Bad Performance Finetuning Llama Chat and Instruct Models on GSM8K Beginners	5	1114	December 5, 2024
Llama-2-7b-chat fine-tuning Models	4	6789	April 26, 2024
Fine tune GPT2/LLaMA in seq2seq manner 🤗Transformers	2	1553	January 14, 2024
Generation / Inference Models	0	252	December 11, 2023

LLaMa3.1 8B Instruct Prompt Tuning for Text Classification doesn't improve test accuracy

Related topics