Question/Bug about accelerator.gather (how to use accelerate/accelerator.gather for contrastive learning)

This discussion has also been post at https://github.com/huggingface/accelerate/issues/1154

Hi, there.

I am new to accelerate and I’ve found that it really improves my development productivity. Thanks for your great work.

But I have some problems when using accelerator.gather.

I trained a simple resnet18 classifier on the CIFAR10 dataset. The training loop is:

for idx, (inputs, targets) in enumerate(train_loader):
    outputs = net(inputs)

    # ********************** loss plan 1 **********************
    loss = criterion(outputs, targets)
    # ********************** loss plan 1 **********************

    # ********************** loss plan 2 **********************
    # out_gather=accelerator.gather(outputs)
    # tar_gather=accelerator.gather(targets)
    # loss = criterion(out_gather, tar_gather)
    # ********************** loss plan 2 **********************

    optimizer.zero_grad()
    accelerator.backward(loss)
    optimizer.step()

The code above works well and the training accuracy reaches about 70% after 10 epochs.

But there is a problem when I train as follows:

  for idx, (inputs, targets) in enumerate(train_loader):
      outputs = net(inputs)

      # ********************** loss plan 1 **********************
      # loss = criterion(outputs, targets)
      # ********************** loss plan 1 **********************

      # ********************** loss plan 2 **********************
      out_gather=accelerator.gather(outputs)
      tar_gather=accelerator.gather(targets)
      loss = criterion(out_gather, tar_gather)
      # ********************** loss plan 2 **********************

      optimizer.zero_grad()
      accelerator.backward(loss)
      optimizer.step()

The training loss is almost unchanged, and the training accuracy has been maintained at about 10%, which is equivalent to random guessing.

The above code may look weird, but I don’t think it should be wrong, but it is.

( The reason I’m doing this is that I want to use accelerate when training for contrastive learning tasks. In contrastive learning, the larger the batch_size, the better, and each sample in the batch uses all other samples in the batch as negative examples to calculate the loss. For example, when I train with four gpus and the batch_size of each gpu is 64, I want each sample to be compared with 64*4-1 negative samples instead of 64-1. In this case I need to use accelerator.gather.)

The full code is as follows: (it works well for loss plan 1 but not for loss plan 2)

# main.py
# CUDA_VISIBLE_DEVICES="0,1,2,3" accelerate launch --multi_gpu main.py

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from accelerate import Accelerator

accelerator=Accelerator()

BATCH_SIZE = 256
EPOCHS = 10

if __name__ == "__main__":

    device = accelerator.device

    net = torchvision.models.resnet18(pretrained=False, num_classes=10)

    trainset = torchvision.datasets.CIFAR10(
        root="./data",
        train=True,
        download=True,
        transform=transforms.Compose(
            [
                transforms.RandomCrop(32, padding=4),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(
                    (0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)
                ),
            ]
        ),
    )

    train_loader = torch.utils.data.DataLoader(
        trainset,
        batch_size=BATCH_SIZE,
        num_workers=4,
        pin_memory=True,
        shuffle=True
    )

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(
        net.parameters(),
        lr=0.01 * 2,
        momentum=0.9,
        weight_decay=0.0001,
        nesterov=True,
    )

    net,optimizer,train_loader=accelerator.prepare(net,optimizer,train_loader)

    net.train()
    for ep in range(1, EPOCHS + 1):
        train_loss = correct = total = 0

        for idx, (inputs, targets) in enumerate(train_loader):
            outputs = net(inputs)

            # ********************** loss plan 1 **********************
            # loss = criterion(outputs, targets)
            # ********************** loss plan 1 **********************

            # ********************** loss plan 2 **********************
            out_gather=accelerator.gather(outputs)
            tar_gather=accelerator.gather(targets)
            loss = criterion(out_gather, tar_gather)
            # ********************** loss plan 2 **********************

            optimizer.zero_grad()
            accelerator.backward(loss)
            optimizer.step()

            train_loss += loss.item()
            total+=targets.size(0)
            correct += torch.eq(outputs.argmax(dim=1), targets).sum().item()

            print(
                "   == step: [{:3}/{}] [{}/{}] | loss: {:.3f} | acc: {:6.3f}%".format(
                    idx + 1,
                    len(train_loader),
                    ep,
                    EPOCHS,
                    train_loss / (idx + 1),
                    100.0 * correct / total,
                )
            )

I’m wondering where I’m going wrong with my code, or how I should use accelerator.gather correctly.

Thanks a lot.

I am also facing the same problem! Waiting for the reply.

1 Like