I am finetuning llama2 uusing LoRA and QLoRA to see the differences in both. I first trained on loRA with special end token <|end|> so that the model knows when to stop. With loRA fintuning it works fine and model also predicts the <|end|> token. keeping the trainings configuration same apart form 4 bit quantization with QLoRA, I see the model cannot predict the <|end|>.
Also when I prepare the peft model, I do load the model using prepare_model_for_kbit_training and then do get_peft_model. Do I need to do prepare_model_for_kbit_training when I do 4 bit quantization in QLoRA. Becuase If I don’t do that then it CUDA OOM. Every thing is kept same like batch size and all other params for loRA and QLoRA.
What could be the reason for less accuracy with QLoRA. If I understood it decreases the less GPU utilizattion but does it affect the model performance.