CUDA out of memory when training mt5-XL

I’m trying to train an LLM (mt5-XL) using the transformer library, but i keep getting the error:

torch.cuda.OutOfMemoryError: CUDA out of memory

Even though i have 80gb RAM and this model should only need about 48gb according to (Model Memory Utility - a Hugging Face Space by hf-accelerate)

This is how the code is looking now:

tokenizer = T5Tokenizer.from_pretrained("google/mt5-xl", cache_dir='cache')
model = T5ForConditionalGeneration.from_pretrained("google/mt5-xl", cache_dir='cache')

training_args = TrainingArguments(run_name = "MSM-300k-b64-mt5-XL-DSI",
    output_dir = "models/MSM-300k-b64-mt5-XL-DSI",
    learning_rate = 0.0005,
    warmup_steps = 10000,
    per_device_train_batch_size = 64,
    per_device_eval_batch_size = 8,
    evaluation_strategy = "steps",
    eval_steps = 1000,
    max_steps = 300000,
    save_strategy = "steps",
    dataloader_num_workers = 10,
    save_steps = 1000,
    save_total_limit = 10,
    gradient_accumulation_steps = 1,
    report_to = "none",
    logging_steps = 100,
    dataloader_drop_last = False,
    metric_for_best_model = "Hits@10",
    greater_is_better = True)

train_dataset = IndexingTrainDataset(path_to_data="path_to_train_data.json",
                                             max_length=256,
                                             cache_dir='cache',
                                             tokenizer=tokenizer)

valid_dataset = IndexingTrainDataset(path_to_data="path_to_dev_data.json",
                                             max_length=256,
                                             cache_dir='cache',
                                             remove_prompt=True,
                                             tokenizer=tokenizer)

trainer = DSITrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=IndexingCollator(
        tokenizer,
        padding='longest',
    ),
    compute_metrics=make_compute_metrics(tokenizer, train_dataset.valid_ids),
    restrict_decode_vocab=restrict_decode_vocab,
    id_max_length=256
)
trainer.train()

I dont know why the program is needing so much memory when 80gb should be well enough?

Not an expert opinion but it’s not just loading the model, but the data also takes space. Suggest decreasing the batch size.