Worse performance using Accelerate

Hi there,

I am trying to use Accelerate to to distribute my training across multiple GPUs and I noticed that the training loss wasn’t decreasing as fast as it should so I ran a new accelerate config to create the most basic one, without distributed training, to debug. Here it is:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: '3'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I seem to still be facing the same problem, where the training loss decreases much slower, compared to a version of my script using only Pytorch. Here is the first version of training script I decided to use with Accelerate:

for epoch in range(1):

    model.train()
    dropout_modules = [module for module in model.lang_encoder.modules() if isinstance(module,torch.nn.Dropout)]
    [module.eval() for module in dropout_modules] # disable dropout
    accelerator.print(f"Disabled {len(dropout_modules)} Dropout modules")

    for i, batch in enumerate(tqdm(llavar_dataloader, disable=not accelerator.is_local_main_process)):

        with accelerator.accumulate(model):

            outputs = model(vision_x=batch["vision_x"], lang_x=batch["lang_x"].to(device), labels=batch["labels"].to(device), device=device)
            loss = outputs.loss
            if accelerator.is_local_main_process:
                total_loss += float(loss.item())
            accelerator.backward(loss)

            if accelerator.sync_gradients:
                accelerator.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()
            model.zero_grad()

            if ((i+1) % accelerator.gradient_accumulation_steps == 0 or i+1 == len(llavar_dataloader)) and accelerator.is_local_main_process:
                with open("llavar_"+args.file, "a") as f:
                    f.write(f"Epoch: {epoch}, Loss: {total_loss/args.acc}\n")
                total_loss = 0 # Reset accumulated loss

Here is the problem: the training loss of both versions, using Accelerate and Pytorch, in the first iteration is the exact same. After that the loss decreases much faster in the pure Pytorch script. Note that I am using seeds in both, so that is why the first iteration is the exact same…

To debug I decided to remove the .accumulate call and build my own gradient accumulation flow, like this:

for epoch in range(1):

    model.train()
    dropout_modules = [module for module in model.lang_encoder.modules() if isinstance(module,torch.nn.Dropout)]
    [module.eval() for module in dropout_modules] # disable dropout
    accelerator.print(f"Disabled {len(dropout_modules)} Dropout modules")

    for i, batch in enumerate(tqdm(llavar_dataloader, disable=not accelerator.is_local_main_process)):


        outputs = model(vision_x=batch["vision_x"], lang_x=batch["lang_x"].to(device), labels=batch["labels"].to(device), device=device)
        loss = outputs.loss
        if accelerator.is_local_main_process:
            total_loss += float(loss.item())
        accelerator.backward(loss / args.acc)            

        

        if ((i+1) % accelerator.gradient_accumulation_steps == 0 or i+1 == len(llavar_dataloader)) and accelerator.is_local_main_process:
            with open("llavar_"+args.file, "a") as f:
                f.write(f"Epoch: {epoch}, Loss: {total_loss/args.acc}\n")
            total_loss = 0 # Reset accumulated loss

            accelerator.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()
            model.zero_grad()

and the results were very similar to the results in my Pytorch script when I removed gradient clipping. I also tested commenting out the line accelerator.clip_grad_norm_(model.parameters(), 1.0) and the results were the exact same. Note that added gradient clipping in my Pytorch script changes results a bit, and I also add it right before I call optimizer.step().

It looks like this issue has something to do with how gradients are accumulated/clipped, but I can’t seem to pinpoint the problem…

UPDATE: switching accelerator.clip_grad_norm_ for the regular torch.nn.utils.clip_grad_norm_ does nothing to the training loss - using any form of gradient clipping (or none) seems to yield the exact same results using Accelerate, which is the same training loss as my pure Pytorch script without gradient clipping, but adding gradient clipping in my Pytorch script further improves the training loss…

Also, here is my pure Pytorch training loop/script:

for epoch in range(1):

    model.train()
    dropout_modules = [module for module in model.lang_encoder.modules() if isinstance(module,torch.nn.Dropout)]
    [module.eval() for module in dropout_modules] # disable dropout
    print(f"Disabled {len(dropout_modules)} Dropout modules")

    for i, batch in enumerate(tqdm(coco_dataloader)):
        # Assume batch['image'] and batch['text'] are image and text tensors respectively
        outputs = model(vision_x=batch["vision_x"], lang_x=batch["lang_x"].to(f"cuda:{args.cuda}"), labels=batch["labels"].to(f"cuda:{args.cuda}"), device=f"cuda:{args.cuda}")
        loss = outputs.loss / num_accumulation_steps
        loss.backward()

        if (i+1) % num_accumulation_steps != 0:
            total_loss = loss.item() + total_loss
        # done once every num_accumulation_steps (5) steps
        elif (i+1) % num_accumulation_steps == 0:
            # Normalize loss
            total_loss = (total_loss + loss.item())
            # parameters updated
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
            optimizer.step()
            # Reset gradient tensors
            model.zero_grad()
        
            with open("llavar_"+args.file, "a") as f:
                f.write(f"Epoch: {epoch}, Loss: {total_loss}\n")

            total_loss=0