Using Huggingface-Trainer with 2 GPUs (Endless Loop)

Hey I am trying to use the built in Trainer with 2 A100 80GB on the RedPajama-INCITE-7B-Base. Since I used PEFT before to train the model I could easily run it on 1 A100 80GB, but without PEFT i don’t get to train it at all running out of memory everytime. So when I try to allocate 2 GPUs now, I can load the model onto 1 of those but when I try to use trainer.train() i don’t get an output and it seems i’m running into an endless loop.

This is my Trainer and training_args if that helps:

training_args = TrainingArguments(
    output_dir= "/home/kubiak/FullTune",
    logging_dir= "/home/kubiak/FullTune",
    learning_rate= 3e-4,
    per_device_train_batch_size= 8,
    per_device_eval_batch_size= 8,
    num_train_epochs= 10,
    weight_decay= 0.01,
    evaluation_strategy= "epoch",
    save_strategy= "epoch",
    load_best_model_at_end= True,
    logging_steps=logging_steps,
    optim="adamw_torch",
    save_total_limit = 1

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 24, id2label=num_to_labels, label2id=labels_to_num).to(device)
model.config.pad_token_id = model.config.eos_token_id

trainer = Trainer(
    model = model,
    args= training_args,
    train_dataset= processed_data["train"],
    eval_dataset= processed_data["test"],
    tokenizer= tokenizer,
    data_collator= data_collator,
    compute_metrics= compute_metrics
)

result = trainer.train()

print_summary(result)
)

For the device I have tried with “cuda” and “cuda:0” or “cuda:1” and nothing seems to work.

Thank you for your help!

1 Like