Errors with Distributed Fine Tuning T5 for seq2seq on sagemaker

I am trying to fine tune t5 large on a seq2seq task on a sagemaker notebook to rewrite old content to fit new standards and formatting. I was able to fine tune t5-small, but haven’t been able to run the larger model. Even on larger ec2 instances, like the ml.g5.12xlarge im using, i run into this error:

CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 22.20 GiB total capacity; 20.78 GiB already allocated; 74.06 MiB free; 21.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It seems like Trainer isn’t distributing the model. How can I change that? Seq2Seqtrainingarguments doesn’t have a nproc_per_node argument that i’ve seen commonly mentioned.

These are the training arguments:

batch_size = 1

model_name = "T5ForConditionalGeneration_workday"
args = Seq2SeqTrainingArguments(
   f"{model_name}-finetuned-old_content-to-new_content",
   overwrite_output_dir=True,
   evaluation_strategy = "epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   weight_decay=0.01,
   num_train_epochs=3,
   fp16=True,
   optim="adafactor",
   predict_with_generate=True
)

trainer = Seq2SeqTrainer(
   model,
   args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)
trainer.train()