Facing OOM error in TROCR finetuning on custom data

Hi,

I have recently started working with Deep Learning models & I am trying to finetune Microsoft TROCR model. I am using 4 T4 GPUs with 16 GB memory each (g4dn.12xlarge AWS machine). I was using default optimizer initially but training was not starting due to memory constraints. Later I switched to Adafactor which resolved the training start issue. However, when the model tries to evaluate on the validation dataset, it fails with Cuda OOM error. I started with batch size of 16 & came down to 4 but still getting the same error. I am using the following code snippets & configs:

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

print_gpu_utilization()
`GPU memory occupied: 258 MB.`

torch.ones((1, 1)).to("cuda")
print_gpu_utilization()
`GPU memory occupied: 363 MB.`

gc.collect()
torch.cuda.empty_cache()
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed").to("cuda")
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
print_gpu_utilization()
`GPU memory occupied: 1701 MB.`

model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
# make sure vocab size is set correctly
model.config.vocab_size = model.config.decoder.vocab_size

# set beam search parameters
model.config.eos_token_id = processor.tokenizer.sep_token_id
model.config.max_length = 98
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4

device = torch.device('cuda')
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=False,
    evaluation_strategy="steps",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    fp16=True,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    output_dir="./runs",
    logging_steps=5,
    save_steps=500,
    eval_steps=100,
    optim="adafactor",
    num_train_epochs=15
    )

trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=processor.feature_extractor,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator
)
trainer.train()

Considering the model size & parameters, I think that the memory should be enough & this might be an issue with config or training params. What steps can I take to improve my training process? Thanks in advance.