Cannot use multiple GPUs

evankomp · July 14, 2023, 8:11pm

Problem: CUDA memory error EXCLUSIVELY when using multiple GPUs

Background:
Custom training script and dataset. Have multiple a40 gpus
Seq2SeqTrainer training of T5, accelerate is installed.

Here are the arguments, and training portions of the script. Happy to give preprocessing steps but don’t think its related.

args = Seq2SeqTrainingArguments(
        output_dir='./data/nomelt-model/model',
        # Trainer meta parameters
        log_level='debug',
        do_train=True,
        do_eval=True,
        evaluation_strategy=save_strat,
        eval_steps=steps_per_save,
        prediction_loss_only=True,
        save_strategy=save_strat,
        save_steps=steps_per_save,
        save_total_limit=20,
        logging_strategy='steps',
        logging_steps=1,
        predict_with_generate=True,
        generation_max_length=params['training']['max_length'],
        load_best_model_at_end=True,
        # training parameters
        num_train_epochs=params['training']['epochs'],
        # batches
        per_device_train_batch_size=params['training']['per_device_batch_size'],
        per_device_eval_batch_size=params['training']['per_device_batch_size'],
        gradient_accumulation_steps=params['training']['gradient_accumulation'],
        gradient_checkpointing=params['training']['gradient_checkpointing'],
        auto_find_batch_size=params['training']['auto_find_batch_size'],
        # optimizer
        learning_rate=params['training']['learning_rate'],
        lr_scheduler_type=params['training']['lr_scheduler_type'],
        warmup_ratio=params['training']['warmup_ratio'],
        optim=params['training']['optim'],
        optim_args=params['training']['optim_args'],
        label_smoothing_factor=params['training']['label_smoothing_factor'],
        # precision
        fp16=params['training']['fp16'],
    )

trainer = Seq2SeqTrainer(
            model=model,
            args=args,
            train_dataset=dataset['train'],
            eval_dataset=dataset['eval_sample'] if 'eval_sample' in dataset else dataset['eval'],
            data_collator=data_collator,
            tokenizer=tokenizer,
        )
trainer.train()

The params are retrieved from a separate file using DVC (I am using DVC live):

training:
  keep_only_extremes: false # equivalent to data.keep_only_extremes, except the filtering happens before train time instead of in creating the saved dataset
  max_length: 250
  epochs: 1
  per_device_batch_size: 1
  auto_find_batch_size: false
  learning_rate: 1e-4
  gradient_accumulation: 1
  gradient_checkpointing: true
  saves_per_epoch: null # null means save only at the end of training
  lr_scheduler_type: 'linear'
  warmup_ratio: 0.1
  label_smoothing_factor: 0.0
  optim: "adafactor"
  optim_args: "scale_parameter=False,relative_step=False"
  fp16: true
  max_eval_examples: 500    # only during training
  dev_sample_data: 20

The model takes up about 25% of gpu memory.
Running script via python train.py
When I run the training on a single gpu, I get about 40% memory usage.
If I change nothing except run with multiple GPUs available in the job, memory explodes. I am skeptical that simply the communication overhead is using >50% of GPU memory.

I have tried accelerate launch (though I was under the assumption that using transformers’ Trainer class avoids this need) and got some weird unrelated error from DVC.

Any help is appreciated.

Topic		Replies	Views
Using 3 GPUs for training with Trainer() of transformers 🤗Transformers	2	2305	October 18, 2023
Cuda Out of Memory with Multi-GPU Accelerate for gemma-2b 🤗Accelerate	1	134	December 22, 2024
`Accelerator.prepare` utilize only one GPU instead of all the 8 available GPUs and raises "CUDA out of memory" 🤗Accelerate	3	2855	July 19, 2024
Multi gpu not working 🤗Transformers	2	2220	February 3, 2023
CUDA out of memory when running on multiple GPUs Beginners	0	582	June 22, 2022

Cannot use multiple GPUs

Related topics