DPOTrainer consumes lots of VRAM

Hello all, currently I am trying to DPO for Phi-3-mini 128k. The DPOTrainer for the model takes more than 30GB. I am using Kaggle notebook with 2 Tesla T4 GPU’s (each of 15 GB each). Here is my training configuration

training_params = transformers.TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    warmup_steps=2,
    learning_rate=5e-5,
    fp16=False,
    logging_steps=4,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    gradient_checkpointing=True,
)

trainer = DPOTrainer(
    model,
    ref_model=None,
    args=training_params,
    beta=0.01,
    train_dataset=raw_datasets["train"],
    tokenizer=tokenizer,
    peft_config=lora_config,
    max_prompt_length=7000,
    max_length=8192,
    reference_free=True,
)

Am I doing anything wrong here ? Can someone explain how to resolve this ?