Decreasing performance when using Accelerate

Hi everyone.

I am trying to use Accelerate to replicate the behaviour of a training with the Trainer class. The time of training is similar to the time used with Trainer but the performance is much worse, getting a solid 70% accuracy with Trainer and around a 35% with Accelerate.

I understand that the logic behind Trainer is much complex than the one behind my Accelerate loop, so I expected that maybe the metrics would differ a little bit, but not this much.

I use the same optimizer, lr_scheduler and hyperparameters (and even the same seed) as I was using with Trainer but the results are nowhere close.

The dataset I use is a custom dataset, the task is text classification and I have 4 V100 GPUs.

My training loop is this one:

for epoch in range(config['num_train_epochs']):
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        loss = loss / config['gradient_accumulation_steps']
        if step % config['gradient_accumulation_steps'] == 0:
        if step % 10 == 0:
            accelerator.print(f'loss: {loss}')

    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)
        predictions = outputs.logits.argmax(dim=-1)

    eval_metric = metric.compute()
    accelerator.print(f'epoch {epoch}: {eval_metric}')

My config file is the following:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 4

And I have used nn.DataParallel achieving the same results as with Trainer, but a much less stable training.

Thank you very much for your help :slight_smile:


By adding accelerator.clip_grad_norm_(model.parameters(), max_norm=1) after the accelerator.backward(loss) line I get a 42% accuracy. Still far from the Trainer one but it’s a little improvement.


I have tried running the script with python instead of accelerate launch (meaning it will only use 1 GPU) and I got a 70% accuracy (same as with Trainer class) but in (obviously) more time. So, I guess that the ‘problem’ is in the Accelerate library (it is not easy to generalize parallel distribution for every training with just a few lines :slight_smile: ). I’m going to try DistributedDataParallel to see if I get the same results as with Accelerate or as Trainer.

Any hints are still welcome :slight_smile:

Interesting. Firs things first, is your training loss the same or not (roughly)? This would allow us to know whether the problem is in the training itself or the evaluation code.

Are you passing the eval_dataloader to accelerator.prepare? Could you try without passing it and just running a normal evaluation (it won’t be distributed but all GPUs will evaluate on the whole dataset) to see if the problem comes from there for some reason?

Very interested into knowing more about that bug.