I’m trying to train an LLM (mt5-XL) using the transformer library, but i keep getting the error:
torch.cuda.OutOfMemoryError: CUDA out of memory
Even though i have 80gb RAM and this model should only need about 48gb according to (Model Memory Utility - a Hugging Face Space by hf-accelerate)
This is how the code is looking now:
tokenizer = T5Tokenizer.from_pretrained("google/mt5-xl", cache_dir='cache')
model = T5ForConditionalGeneration.from_pretrained("google/mt5-xl", cache_dir='cache')
training_args = TrainingArguments(run_name = "MSM-300k-b64-mt5-XL-DSI",
output_dir = "models/MSM-300k-b64-mt5-XL-DSI",
learning_rate = 0.0005,
warmup_steps = 10000,
per_device_train_batch_size = 64,
per_device_eval_batch_size = 8,
evaluation_strategy = "steps",
eval_steps = 1000,
max_steps = 300000,
save_strategy = "steps",
dataloader_num_workers = 10,
save_steps = 1000,
save_total_limit = 10,
gradient_accumulation_steps = 1,
report_to = "none",
logging_steps = 100,
dataloader_drop_last = False,
metric_for_best_model = "Hits@10",
greater_is_better = True)
train_dataset = IndexingTrainDataset(path_to_data="path_to_train_data.json",
max_length=256,
cache_dir='cache',
tokenizer=tokenizer)
valid_dataset = IndexingTrainDataset(path_to_data="path_to_dev_data.json",
max_length=256,
cache_dir='cache',
remove_prompt=True,
tokenizer=tokenizer)
trainer = DSITrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
data_collator=IndexingCollator(
tokenizer,
padding='longest',
),
compute_metrics=make_compute_metrics(tokenizer, train_dataset.valid_ids),
restrict_decode_vocab=restrict_decode_vocab,
id_max_length=256
)
trainer.train()
I dont know why the program is needing so much memory when 80gb should be well enough?