Hi everyone.
I am trying to use Accelerate to replicate the behaviour of a training with the Trainer class. The time of training is similar to the time used with Trainer but the performance is much worse, getting a solid 70% accuracy with Trainer and around a 35% with Accelerate.
I understand that the logic behind Trainer is much complex than the one behind my Accelerate loop, so I expected that maybe the metrics would differ a little bit, but not this much.
I use the same optimizer, lr_scheduler and hyperparameters (and even the same seed) as I was using with Trainer but the results are nowhere close.
The dataset I use is a custom dataset, the task is text classification and I have 4 V100 GPUs.
My training loop is this one:
for epoch in range(config['num_train_epochs']):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
loss = loss / config['gradient_accumulation_steps']
accelerator.backward(loss)
if step % config['gradient_accumulation_steps'] == 0:
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
if step % 10 == 0:
accelerator.print(f'loss: {loss}')
model.eval()
for step, batch in enumerate(eval_dataloader):
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
metric.add_batch(
predictions=accelerator.gather(predictions),
references=accelerator.gather(batch['labels'])
)
eval_metric = metric.compute()
accelerator.print(f'epoch {epoch}: {eval_metric}')
My config file is the following:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 4
And I have used nn.DataParallel
achieving the same results as with Trainer, but a much less stable training.
Thank you very much for your help
UPDATE:
By adding accelerator.clip_grad_norm_(model.parameters(), max_norm=1)
after the accelerator.backward(loss)
line I get a 42% accuracy. Still far from the Trainer one but it’s a little improvement.
UPDATE2:
I have tried running the script with python train.py
instead of accelerate launch run_glue_no_trainer.py
(meaning it will only use 1 GPU) and I got a 70% accuracy (same as with Trainer class) but in (obviously) more time. So, I guess that the ‘problem’ is in the Accelerate library (it is not easy to generalize parallel distribution for every training with just a few lines ). I’m going to try DistributedDataParallel to see if I get the same results as with Accelerate or as Trainer.
Any hints are still welcome