Imbalance memory usage on multi_gpus

Hi,

I am using the Trainer API for training a Bart model.

training_args = Seq2SeqTrainingArguments(
    output_dir='./models/bart',
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    num_train_epochs=5,           
    per_device_train_batch_size=2, 
    per_device_eval_batch_size=2,   
    warmup_steps=500,               
    weight_decay=0.01,              
    predict_with_generate=True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,                       
    args=training_args,                  
    train_dataset=train_dataset,        
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer
)

I found out that the memory usage when training on multi-gpus is imbalance

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14760      C   python                          10513MiB |
|    1   N/A  N/A     14760      C   python                           4811MiB |
|    2   N/A  N/A     14760      C   python                           4811MiB |
|    3   N/A  N/A     14760      C   python                           4811MiB |
|    4   N/A  N/A     14760      C   python                           4811MiB |
|    5   N/A  N/A     14760      C   python                           4811MiB |
|    6   N/A  N/A     14760      C   python                           4811MiB |
|    7   N/A  N/A     14760      C   python                           4811MiB |
+-----------------------------------------------------------------------------+

Is there a way to balance the memory usage?

The reason for this, as far as I know, that all the models in the GPUs 1-7 have a copy in the GPU 0. The computed gradients on GPUs 1-7 are brought back to the GPU 0 for the backward pass to synchronize all the copies. After backpropagation, the newly obtained model parameters are distributed again to the GPUs 1-7. Forward pass is distributed, backward pass is syncronized.

So, it is necessary for a GPU to have copies of the models in other GPUs. Currently, I am not aware of a method to reduce the memory usage in the main GPU.

1 Like

Thanks for your reply!