Dear all,
When using accelerate for speeding up my training I have the problem, that the calculated accuracy far away from what I get when running the same model on one graphic card is.
network, optimizer, trainData, validationData, scheduler = accelerator.prepare(network, optimizer, trainData, validationData, None)
for epoch in range(args[“epochs”]):
# Set the model in training mode network.train() trainRunningLoss = 0.0 trainRunningCorrect = 0 for i, (images, labels) in enumerate(trainData): optimizer.zero_grad() # Forward pass outputs = network(images) # Loss calculation loss = criterion(outputs, labels) # Calculate the accuracy trainRunningLoss += loss.item() _, preds = torch.max(outputs.data, 1) trainRunningCorrect += (preds == labels).sum().item() # Backpropagation accelerator.backward(loss) # Updating the optimizer parameters optimizer.step() # Loss and accuracy for the complete epoch history['train loss'].append(trainRunningLoss) history['train acc'].append(100 * trainRunningCorrect / len(trainData.dataset.samples))
Running the code above I get four (4) values for the accuracy (one for GPU) with is ~ 24% (times 4 ~ 96% accuracy, which is very close to what I get when running the same model on one GPU).
For instance:
[INFO] Epoch: 2/10, Loss: 172.9853, Accuracy: 22.6389, time elapsed: 24.15 …
[INFO] Epoch: 2/10, Loss: 303.3730, Accuracy: 22.5694, time elapsed: 24.23 …
[INFO] Epoch: 2/10, Loss: 325.4448, Accuracy: 22.2222, time elapsed: 24.22 …
[INFO] Epoch: 2/10, Loss: 429.7866, Accuracy: 22.2222, time elapsed: 24.22 …
…
How do I get rid of this problem?