Cannot use multiple GPUs

Problem: CUDA memory error EXCLUSIVELY when using multiple GPUs

Background:
Custom training script and dataset. Have multiple a40 gpus
Seq2SeqTrainer training of T5, accelerate is installed.

Here are the arguments, and training portions of the script. Happy to give preprocessing steps but don’t think its related.

args = Seq2SeqTrainingArguments(
        output_dir='./data/nomelt-model/model',
        # Trainer meta parameters
        log_level='debug',
        do_train=True,
        do_eval=True,
        evaluation_strategy=save_strat,
        eval_steps=steps_per_save,
        prediction_loss_only=True,
        save_strategy=save_strat,
        save_steps=steps_per_save,
        save_total_limit=20,
        logging_strategy='steps',
        logging_steps=1,
        predict_with_generate=True,
        generation_max_length=params['training']['max_length'],
        load_best_model_at_end=True,
        # training parameters
        num_train_epochs=params['training']['epochs'],
        # batches
        per_device_train_batch_size=params['training']['per_device_batch_size'],
        per_device_eval_batch_size=params['training']['per_device_batch_size'],
        gradient_accumulation_steps=params['training']['gradient_accumulation'],
        gradient_checkpointing=params['training']['gradient_checkpointing'],
        auto_find_batch_size=params['training']['auto_find_batch_size'],
        # optimizer
        learning_rate=params['training']['learning_rate'],
        lr_scheduler_type=params['training']['lr_scheduler_type'],
        warmup_ratio=params['training']['warmup_ratio'],
        optim=params['training']['optim'],
        optim_args=params['training']['optim_args'],
        label_smoothing_factor=params['training']['label_smoothing_factor'],
        # precision
        fp16=params['training']['fp16'],
    )

trainer = Seq2SeqTrainer(
            model=model,
            args=args,
            train_dataset=dataset['train'],
            eval_dataset=dataset['eval_sample'] if 'eval_sample' in dataset else dataset['eval'],
            data_collator=data_collator,
            tokenizer=tokenizer,
        )
trainer.train()

The params are retrieved from a separate file using DVC (I am using DVC live):

training:
  keep_only_extremes: false # equivalent to data.keep_only_extremes, except the filtering happens before train time instead of in creating the saved dataset
  max_length: 250
  epochs: 1
  per_device_batch_size: 1
  auto_find_batch_size: false
  learning_rate: 1e-4
  gradient_accumulation: 1
  gradient_checkpointing: true
  saves_per_epoch: null # null means save only at the end of training
  lr_scheduler_type: 'linear'
  warmup_ratio: 0.1
  label_smoothing_factor: 0.0
  optim: "adafactor"
  optim_args: "scale_parameter=False,relative_step=False"
  fp16: true
  max_eval_examples: 500    # only during training
  dev_sample_data: 20

The model takes up about 25% of gpu memory.
Running script via python train.py
When I run the training on a single gpu, I get about 40% memory usage.
If I change nothing except run with multiple GPUs available in the job, memory explodes. I am skeptical that simply the communication overhead is using >50% of GPU memory.

I have tried accelerate launch (though I was under the assumption that using transformers’ Trainer class avoids this need) and got some weird unrelated error from DVC.

Any help is appreciated.