I am trying to fine tune t5 large on a seq2seq task on a sagemaker notebook to rewrite old content to fit new standards and formatting. I was able to fine tune t5-small, but haven’t been able to run the larger model. Even on larger ec2 instances, like the ml.g5.12xlarge im using, i run into this error:
CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 22.20 GiB total capacity; 20.78 GiB already allocated; 74.06 MiB free; 21.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It seems like Trainer isn’t distributing the model. How can I change that? Seq2Seqtrainingarguments doesn’t have a nproc_per_node argument that i’ve seen commonly mentioned.
These are the training arguments:
batch_size = 1
model_name = "T5ForConditionalGeneration_workday"
args = Seq2SeqTrainingArguments(
f"{model_name}-finetuned-old_content-to-new_content",
overwrite_output_dir=True,
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
num_train_epochs=3,
fp16=True,
optim="adafactor",
predict_with_generate=True
)
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()