Hey I am trying to use the built in Trainer with 2 A100 80GB on the RedPajama-INCITE-7B-Base. Since I used PEFT before to train the model I could easily run it on 1 A100 80GB, but without PEFT i don’t get to train it at all running out of memory everytime. So when I try to allocate 2 GPUs now, I can load the model onto 1 of those but when I try to use trainer.train() i don’t get an output and it seems i’m running into an endless loop.
This is my Trainer and training_args if that helps:
training_args = TrainingArguments( output_dir= "/home/kubiak/FullTune", logging_dir= "/home/kubiak/FullTune", learning_rate= 3e-4, per_device_train_batch_size= 8, per_device_eval_batch_size= 8, num_train_epochs= 10, weight_decay= 0.01, evaluation_strategy= "epoch", save_strategy= "epoch", load_best_model_at_end= True, logging_steps=logging_steps, optim="adamw_torch", save_total_limit = 1 model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 24, id2label=num_to_labels, label2id=labels_to_num).to(device) model.config.pad_token_id = model.config.eos_token_id trainer = Trainer( model = model, args= training_args, train_dataset= processed_data["train"], eval_dataset= processed_data["test"], tokenizer= tokenizer, data_collator= data_collator, compute_metrics= compute_metrics ) result = trainer.train() print_summary(result) )
For the device I have tried with “cuda” and “cuda:0” or “cuda:1” and nothing seems to work.
Thank you for your help!