How to collect the accuracy when running multi GPU model with accelerate?

Dear all,

When using accelerate for speeding up my training I have the problem, that the calculated accuracy far away from what I get when running the same model on one graphic card is.

network, optimizer, trainData, validationData, scheduler = accelerator.prepare(network, optimizer, trainData, validationData, None)

for epoch in range(args[“epochs”]):

# Set the model in training mode
network.train()

trainRunningLoss = 0.0
trainRunningCorrect = 0

for i, (images, labels) in enumerate(trainData):

    optimizer.zero_grad()

    # Forward pass
    outputs = network(images)

    # Loss calculation
    loss = criterion(outputs, labels)

    # Calculate the accuracy
    trainRunningLoss += loss.item()

     _, preds = torch.max(outputs.data, 1)
     trainRunningCorrect += (preds == labels).sum().item()

     # Backpropagation
     accelerator.backward(loss)

    # Updating the optimizer parameters
    optimizer.step()

    # Loss and accuracy for the complete epoch
    history['train loss'].append(trainRunningLoss)
    history['train acc'].append(100 * trainRunningCorrect / len(trainData.dataset.samples))

Running the code above I get four (4) values for the accuracy (one for GPU) with is ~ 24% (times 4 ~ 96% accuracy, which is very close to what I get when running the same model on one GPU).

For instance:

[INFO] Epoch: 2/10, Loss: 172.9853, Accuracy: 22.6389, time elapsed: 24.15 …
[INFO] Epoch: 2/10, Loss: 303.3730, Accuracy: 22.5694, time elapsed: 24.23 …
[INFO] Epoch: 2/10, Loss: 325.4448, Accuracy: 22.2222, time elapsed: 24.22 …
[INFO] Epoch: 2/10, Loss: 429.7866, Accuracy: 22.2222, time elapsed: 24.22 …
…

How do I get rid of this problem?

I’d recommend reading the tutorial guide on metrics. You need to gather via gather_for_metrics :slight_smile: See the “Calculating Metrics” section of this: Learning how to incorporate 🤗 Accelerate features quickly!

For everybody with a similar problem: here the link to an useful tutorial (using the huggingface’s website it could be misleading sometime). But yes:

gather_for_metrics

is what it is needed.

1 Like

I forgot: in case you want to print the result after every epoch and you are getting “n” times the same output, where “n” is the number of your GPUs, then I discovered that there is a print method in accelerator itself:

accelerator.print(f"Epoch: {epoch}, accuracy: {<your metrics>}")

which fixes the issue and you get only one output per epoch.

1 Like