Hey I am trying to use the built in Trainer with 2 A100 80GB on the RedPajama-INCITE-7B-Base. Since I used PEFT before to train the model I could easily run it on 1 A100 80GB, but without PEFT i don’t get to train it at all running out of memory everytime. So when I try to allocate 2 GPUs now, I can load the model onto 1 of those but when I try to use trainer.train() i don’t get an output and it seems i’m running into an endless loop.
This is my Trainer and training_args if that helps:
training_args = TrainingArguments(
output_dir= "/home/kubiak/FullTune",
logging_dir= "/home/kubiak/FullTune",
learning_rate= 3e-4,
per_device_train_batch_size= 8,
per_device_eval_batch_size= 8,
num_train_epochs= 10,
weight_decay= 0.01,
evaluation_strategy= "epoch",
save_strategy= "epoch",
load_best_model_at_end= True,
logging_steps=logging_steps,
optim="adamw_torch",
save_total_limit = 1
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 24, id2label=num_to_labels, label2id=labels_to_num).to(device)
model.config.pad_token_id = model.config.eos_token_id
trainer = Trainer(
model = model,
args= training_args,
train_dataset= processed_data["train"],
eval_dataset= processed_data["test"],
tokenizer= tokenizer,
data_collator= data_collator,
compute_metrics= compute_metrics
)
result = trainer.train()
print_summary(result)
)
For the device I have tried with “cuda” and “cuda:0” or “cuda:1” and nothing seems to work.
Thank you for your help!