Errors with Distributed Fine Tuning T5 for seq2seq on sagemaker

bhoops13 · February 3, 2023, 1:16am

I am trying to fine tune t5 large on a seq2seq task on a sagemaker notebook to rewrite old content to fit new standards and formatting. I was able to fine tune t5-small, but haven’t been able to run the larger model. Even on larger ec2 instances, like the ml.g5.12xlarge im using, i run into this error:

CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 22.20 GiB total capacity; 20.78 GiB already allocated; 74.06 MiB free; 21.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It seems like Trainer isn’t distributing the model. How can I change that? Seq2Seqtrainingarguments doesn’t have a nproc_per_node argument that i’ve seen commonly mentioned.

These are the training arguments:

batch_size = 1

model_name = "T5ForConditionalGeneration_workday"
args = Seq2SeqTrainingArguments(
   f"{model_name}-finetuned-old_content-to-new_content",
   overwrite_output_dir=True,
   evaluation_strategy = "epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=batch_size,
   per_device_eval_batch_size=batch_size,
   weight_decay=0.01,
   num_train_epochs=3,
   fp16=True,
   optim="adafactor",
   predict_with_generate=True
)

trainer = Seq2SeqTrainer(
   model,
   args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)
trainer.train()

Topic		Replies	Views
Fine Tuning GPT-2 - Training job only using test sample size of 5 Amazon SageMaker	4	2138	February 6, 2023
CUDA out of memory when running on multiple GPUs Beginners	0	580	June 22, 2022
Fine-Tuning GPT-J CUDA Memory Error Amazon SageMaker	1	804	February 13, 2023
OutOfMemoryError: CUDA out of memory while trying to replicate this notebook on sagemaker: https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb Amazon SageMaker	4	1683	June 16, 2023
Finetuning text summary model support change pretrained model? Amazon SageMaker	6	444	September 16, 2021

Errors with Distributed Fine Tuning T5 for seq2seq on sagemaker

Related topics