QLoRA with GPTQ

I’m having problems fine-tuning with pre-quantised models. My training loss is sometimes 0 and the validation loss is nan, so I assume this is an overflow issue?
Does anyone see anything obviously wrong with the way I am training my model?

config = GPTQConfig(bits=4, disable_exllama=True)
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GPTQ", quantization_config=config, device_map="auto", torch_dtype="auto")
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7b-Chat-GPTQ")

peft_config = LoraConfig(task_type="CAUSAL_LM", r=64, lora_alpha=16)

...

training_args = TrainingArguments(fp16=True, optim="paged_adamw_32bit", ...)

trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        peft_config=peft_config,
        ...
    )

trainer.train()

#####
                                                                                                             ...  ...                                                                                                             

{'loss': 0.0, 'learning_rate': 0.0004953917050691245, 'epoch': 0.27}                                                                                                                  
{'loss': 0.0, 'learning_rate': 0.0004953917050691245, 'epoch': 0.28}                                                                                                                  
{'loss': 2.0689, 'learning_rate': 0.0004953917050691245, 'epoch': 0.28}                                                                                                                  
{'eval_loss': nan, 'eval_runtime': 149.173, 'eval_samples_per_second': 0.597, 'eval_steps_per_second': 0.302, 'epoch': 0.28}

#####

UserWarning: You passed a tokenizer with padding_sidenot equal to rightto the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = ‘right’ to your code.

It is working now after adding padding_side=right to the tokenizer. Why does the padding side affect overflow in half-precision training?

I’d also like to know why.

Here there is explicitly written to use padding_side='left' Generation with LLMs