Hi,
I have recently started working with Deep Learning models & I am trying to finetune Microsoft TROCR model. I am using 4 T4 GPUs with 16 GB memory each (g4dn.12xlarge AWS machine). I was using default optimizer initially but training was not starting due to memory constraints. Later I switched to Adafactor which resolved the training start issue. However, when the model tries to evaluate on the validation dataset, it fails with Cuda OOM error. I started with batch size of 16 & came down to 4 but still getting the same error. I am using the following code snippets & configs:
def print_gpu_utilization():
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(handle)
print(f"GPU memory occupied: {info.used//1024**2} MB.")
print_gpu_utilization()
`GPU memory occupied: 258 MB.`
torch.ones((1, 1)).to("cuda")
print_gpu_utilization()
`GPU memory occupied: 363 MB.`
gc.collect()
torch.cuda.empty_cache()
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed").to("cuda")
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
print_gpu_utilization()
`GPU memory occupied: 1701 MB.`
model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
# make sure vocab size is set correctly
model.config.vocab_size = model.config.decoder.vocab_size
# set beam search parameters
model.config.eos_token_id = processor.tokenizer.sep_token_id
model.config.max_length = 98
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4
device = torch.device('cuda')
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
training_args = Seq2SeqTrainingArguments(
predict_with_generate=False,
evaluation_strategy="steps",
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
fp16=True,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
output_dir="./runs",
logging_steps=5,
save_steps=500,
eval_steps=100,
optim="adafactor",
num_train_epochs=15
)
trainer = Seq2SeqTrainer(
model=model,
tokenizer=processor.feature_extractor,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=default_data_collator
)
trainer.train()
Considering the model size & parameters, I think that the memory should be enough & this might be an issue with config or training params. What steps can I take to improve my training process? Thanks in advance.