While fine tuning the quantize model of sarvamai/OpenHathi-7B-Hi-v0.1-Base model getting memory error

Hello,

I am trying to fine tune sarvamai/OpenHathi-7B-Hi-v0.1-Base model.
While Fine tuining sarvamai/OpenHathi-7B-Hi-v0.1-Base’s quantize model I am Getting

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-37-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()

29 frames
/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py in forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, **kwargs)
    388         value_states = repeat_kv(value_states, self.num_key_value_groups)
    389 
--> 390         attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
    391 
    392         if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):

OutOfMemoryError: CUDA out of memory. Tried to allocate 406.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 266.81 MiB is free. Process 64479 has 39.29 GiB memory in use. Of the allocated memory 37.52 GiB is allocated by PyTorch, and 1.28 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here tha code that I have tried.

# model_id = "sarvamai/OpenHathi-7B-Hi-v0.1-Base"

model_id = "openhathi-gptq-4bit" # quantize model

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

configuration = OpenAIGPTConfig.from_pretrained("openhathi-gptq-4bit")

configuration.output_hidden_states = True

training_arguments = TrainingArguments(
    # output_dir="/content/drive/MyDrive/CB/LLM/Falcon-7b-MCQ-sample_dataset-model/finetuned_model/SFT_tuning_with_first_two_modules"
    output_dir = "/content/drive/MyDrive/",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    evaluation_strategy='epoch',
    num_train_epochs=6,
    save_strategy='epoch',
    logging_steps=100,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    group_by_length=True,
    warmup_ratio = 0.03,
    lr_scheduler_type="constant",
)

from trl import SFTTrainer
max_seq_length = 2048

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    eval_dataset=dataset_val,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,

)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()