CUDA out of memory when training mt5-XL

JRJ42 · March 11, 2024, 2:53pm

I’m trying to train an LLM (mt5-XL) using the transformer library, but i keep getting the error:

torch.cuda.OutOfMemoryError: CUDA out of memory

Even though i have 80gb RAM and this model should only need about 48gb according to (Model Memory Utility - a Hugging Face Space by hf-accelerate)

This is how the code is looking now:

tokenizer = T5Tokenizer.from_pretrained("google/mt5-xl", cache_dir='cache')
model = T5ForConditionalGeneration.from_pretrained("google/mt5-xl", cache_dir='cache')

training_args = TrainingArguments(run_name = "MSM-300k-b64-mt5-XL-DSI",
    output_dir = "models/MSM-300k-b64-mt5-XL-DSI",
    learning_rate = 0.0005,
    warmup_steps = 10000,
    per_device_train_batch_size = 64,
    per_device_eval_batch_size = 8,
    evaluation_strategy = "steps",
    eval_steps = 1000,
    max_steps = 300000,
    save_strategy = "steps",
    dataloader_num_workers = 10,
    save_steps = 1000,
    save_total_limit = 10,
    gradient_accumulation_steps = 1,
    report_to = "none",
    logging_steps = 100,
    dataloader_drop_last = False,
    metric_for_best_model = "Hits@10",
    greater_is_better = True)

train_dataset = IndexingTrainDataset(path_to_data="path_to_train_data.json",
                                             max_length=256,
                                             cache_dir='cache',
                                             tokenizer=tokenizer)

valid_dataset = IndexingTrainDataset(path_to_data="path_to_dev_data.json",
                                             max_length=256,
                                             cache_dir='cache',
                                             remove_prompt=True,
                                             tokenizer=tokenizer)

trainer = DSITrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=IndexingCollator(
        tokenizer,
        padding='longest',
    ),
    compute_metrics=make_compute_metrics(tokenizer, train_dataset.valid_ids),
    restrict_decode_vocab=restrict_decode_vocab,
    id_max_length=256
)
trainer.train()

I dont know why the program is needing so much memory when 80gb should be well enough?

Sandy1857 · March 11, 2024, 6:30pm

Not an expert opinion but it’s not just loading the model, but the data also takes space. Suggest decreasing the batch size.

Topic		Replies	Views
CUDA out of memory on multi-GPU 🤗Transformers	1	2649	March 6, 2024
Cuda Out of Memory when fine tuning llm model 🤗Transformers	3	1164	May 7, 2024
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5347	June 7, 2023
LLM ingores max_memory in inference Models	0	130	June 20, 2024
torch.cuda.OutOfMemoryError 🤗Transformers	0	2054	July 5, 2023

CUDA out of memory when training mt5-XL

Related topics