CUDA out of memory when training mt5-XL

I’m trying to train an LLM (mt5-XL) using the transformer library, but i keep getting the error:

torch.cuda.OutOfMemoryError: CUDA out of memory

Even though i have 80gb RAM and this model should only need about 48gb according to (Model Memory Utility - a Hugging Face Space by hf-accelerate)

This is how the code is looking now:

tokenizer = T5Tokenizer.from_pretrained("google/mt5-xl", cache_dir='cache')
model = T5ForConditionalGeneration.from_pretrained("google/mt5-xl", cache_dir='cache')

training_args = TrainingArguments(run_name = "MSM-300k-b64-mt5-XL-DSI",
    output_dir = "models/MSM-300k-b64-mt5-XL-DSI",
    learning_rate = 0.0005,
    warmup_steps = 10000,
    per_device_train_batch_size = 64,
    per_device_eval_batch_size = 8,
    evaluation_strategy = "steps",
    eval_steps = 1000,
    max_steps = 300000,
    save_strategy = "steps",
    dataloader_num_workers = 10,
    save_steps = 1000,
    save_total_limit = 10,
    gradient_accumulation_steps = 1,
    report_to = "none",
    logging_steps = 100,
    dataloader_drop_last = False,
    metric_for_best_model = "Hits@10",
    greater_is_better = True)

train_dataset = IndexingTrainDataset(path_to_data="path_to_train_data.json",
                                             max_length=256,
                                             cache_dir='cache',
                                             tokenizer=tokenizer)

valid_dataset = IndexingTrainDataset(path_to_data="path_to_dev_data.json",
                                             max_length=256,
                                             cache_dir='cache',
                                             remove_prompt=True,
                                             tokenizer=tokenizer)

trainer = DSITrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=IndexingCollator(
        tokenizer,
        padding='longest',
    ),
    compute_metrics=make_compute_metrics(tokenizer, train_dataset.valid_ids),
    restrict_decode_vocab=restrict_decode_vocab,
    id_max_length=256
)
trainer.train()

I dont know why the program is needing so much memory when 80gb should be well enough?

1 Like

Not an expert opinion but it’s not just loading the model, but the data also takes space. Suggest decreasing the batch size.