Training out of memory

I’m training the Qwen2 - 500m for classification now, my dataset is Reddit post. Somehow when fine-tuning, it keeps saying that I’m out of memory for CUDA device. I even use floating point 16-bit but it’s still giving the OOM error

Here is my code for model and trainer

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification
from transformers import TrainingArguments
import torch


tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
model = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2-0.5B", num_labels=2)
model.half()
print(model)
model.to(device)

training_args = TrainingArguments(
    output_dir = 'kaggle/input/output_dir',
    do_train=True,
    do_eval=False,
    eval_strategy="epoch",
    num_train_epochs = 3,
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1,
    warmup_steps = 100,
    weight_decay = 0.01,
    eval_accumulation_steps = 1
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)
trainer.train()

I just use batch size 1 and also eval accumulation steps = 1. The GPU I use is P100 / T4 (two options provided by Kaggle).

Please help meeeeeeeeee! <3