Problem: CUDA memory error EXCLUSIVELY when using multiple GPUs
Background:
Custom training script and dataset. Have multiple a40 gpus
Seq2SeqTrainer
training of T5, accelerate
is installed.
Here are the arguments, and training portions of the script. Happy to give preprocessing steps but don’t think its related.
args = Seq2SeqTrainingArguments(
output_dir='./data/nomelt-model/model',
# Trainer meta parameters
log_level='debug',
do_train=True,
do_eval=True,
evaluation_strategy=save_strat,
eval_steps=steps_per_save,
prediction_loss_only=True,
save_strategy=save_strat,
save_steps=steps_per_save,
save_total_limit=20,
logging_strategy='steps',
logging_steps=1,
predict_with_generate=True,
generation_max_length=params['training']['max_length'],
load_best_model_at_end=True,
# training parameters
num_train_epochs=params['training']['epochs'],
# batches
per_device_train_batch_size=params['training']['per_device_batch_size'],
per_device_eval_batch_size=params['training']['per_device_batch_size'],
gradient_accumulation_steps=params['training']['gradient_accumulation'],
gradient_checkpointing=params['training']['gradient_checkpointing'],
auto_find_batch_size=params['training']['auto_find_batch_size'],
# optimizer
learning_rate=params['training']['learning_rate'],
lr_scheduler_type=params['training']['lr_scheduler_type'],
warmup_ratio=params['training']['warmup_ratio'],
optim=params['training']['optim'],
optim_args=params['training']['optim_args'],
label_smoothing_factor=params['training']['label_smoothing_factor'],
# precision
fp16=params['training']['fp16'],
)
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=dataset['train'],
eval_dataset=dataset['eval_sample'] if 'eval_sample' in dataset else dataset['eval'],
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()
The params are retrieved from a separate file using DVC (I am using DVC live):
training:
keep_only_extremes: false # equivalent to data.keep_only_extremes, except the filtering happens before train time instead of in creating the saved dataset
max_length: 250
epochs: 1
per_device_batch_size: 1
auto_find_batch_size: false
learning_rate: 1e-4
gradient_accumulation: 1
gradient_checkpointing: true
saves_per_epoch: null # null means save only at the end of training
lr_scheduler_type: 'linear'
warmup_ratio: 0.1
label_smoothing_factor: 0.0
optim: "adafactor"
optim_args: "scale_parameter=False,relative_step=False"
fp16: true
max_eval_examples: 500 # only during training
dev_sample_data: 20
The model takes up about 25% of gpu memory.
Running script via python train.py
When I run the training on a single gpu, I get about 40% memory usage.
If I change nothing except run with multiple GPUs available in the job, memory explodes. I am skeptical that simply the communication overhead is using >50% of GPU memory.
I have tried accelerate launch (though I was under the assumption that using transformers’ Trainer class avoids this need) and got some weird unrelated error from DVC.
Any help is appreciated.