CUDA Memory with DeepSpeed running on 4 GPUs is the same as 1 GPU

Hi, I’m trying to fine-tune bart-large on 1080Ti (about 11 GB) with deepspeed ZeRO3. But I encountered OOM no matter I used 1 or 4 gpus (with batchsize = 1). I don’t know whether bartlarge is too big for my GPU or I use DeepSpeed incorrectly.
So I tested my code with nothing changed but model from “bart-large” to “bart-base”. I found that the GPU memory usage was the same (about 5000 MiB) no matter I used 1 or 4 gpus. So I guess I use ZeRO3 settings incorrectly. Please help me figure out why. Thank you very much!!!


  1. I use the deep speed configure from transformers/tests/deepspeed/ds_config_zero3.json
  2. Here is my codetraining_args = Seq2SeqTrainingArguments( output_dir='./Model/bartbase', # output directory evaluation_strategy='steps', # Evaluation is done (and logged) every eval_steps. per_device_train_batch_size=train_batch_size, # batch size per device during training per_device_eval_batch_size=2*train_batch_size, # batch size for evaluation gradient_accumulation_steps=update_freq, eval_accumulation_steps=1, max_grad_norm=0.1, # Maximum gradient norm (for gradient clipping). num_train_epochs=20, logging_steps=10, save_steps=save_every, fp16=True, eval_steps=eval_every, disable_tqdm=False, # label_smoothing_factor=label_smoothing, # deepspeed="" ) optimizer = AdamW(model.parameters(), lr=lr, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01) scheduler = get_polynomial_decay_schedule_with_warmup(optimizer=optimizer, num_warmup_steps=warmup_updates, num_training_steps=total_num_update) trainer = Seq2SeqTrainer( model=model, # the instantiated 🤗 Transformers model to be trained args=training_args, # training arguments, defined above train_dataset=train_dataset, # training dataset eval_dataset=valid_dataset, # evaluation dataset data_collator=collator, tokenizer=tokenizer, optimizers=(optimizer, scheduler) )