Hello everyone!
I am trying to fine-tune LLaMa3.1 8B Instruct for text classification with prompt tuning from peft library (prepending a trainable matrix before input embeddings). For instance, I am using the QNLI dataset to classify whether a question and sentence are in entailment.
I have preprocessed the dataset so that each data looks like the following:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
<|eot_id|><|start_header_id|>user<|end_header_id|>
Classify the question and sentence pair into labels: entailment, not entailment. Reply only the corresponding label.
question: What is the name of a former Asian Portuguese colony?
sentence: The country has a tiny Chinese population.
label:<|eot_id|><|start_header_id|>assistant<|end_header_id|>
not entailment<|eot_id|>
In test data, I remove the label and leave only the generation prompt.
When I evaluate the zero-shot performance of LLaMa3.1 with the text-generation pipeline I get around 74% accuracy (which is already really good).
For training, I am using SFTTrainer and SFTConfig. You can find my full config here (there are also some custom parameters, but mostly nothing out of ordinary).
The valid and train loss go down to less than 1 after one epoch. After one epoch I evaluated the test set again and the accuracy was still around 74%.
My question is does the LLaMa3 model ignore the soft-prompt? Or was it just undertrained and I should keep training it for more epochs? Since the loss is decreasing I suppose that it is not ignoring it during training.
Is it even possible to use prompt tuning with instruction fine-tuned models? I also tried manual hyperparameter tunning (mostly learning rate and soft-prompt length) but the results were mostly the same.
If anybody has more experience with prompt tuning and instruction-tuned autoregressive models I would be thankful if they could point me in the right direction.
Thank you!
edits: added info about pipelines that I thought may be useful