Question about calculating training loss of multi-GPU with Accelerate

weaksquare · January 30, 2024, 5:53pm

I am currently training a model in Kaggle with Accelerate (2 T4 GPUs), and I’m confused about how to calculate or log the training loss correctly. According to some discussions
(https://github.com/huggingface/accelerate/issues/2109, https://github.com/huggingface/accelerate/issues/226)
and some official examples (complete_nlp_example), it appears that I may only need to log the training loss on the main process.

I do some test in Kaggle with the following code:

from accelerate import Accelerator, notebook_launcher
from accelerate.utils import set_seed
import torch

data = torch.arange(0, 9).float()
data.requires_grad = True
label = torch.randint(0, 10, (9,)).float()
dataset = list(zip(data, label))

def func():
    set_seed(1234)
    accelerator = Accelerator()
    criterion = torch.nn.L1Loss(reduction='none')
    test_dl = torch.utils.data.DataLoader(dataset, 6, shuffle=True)
    test_dl, criterion = accelerator.prepare(test_dl, criterion)
    
    for batch in test_dl:
        logits, labels = batch
        loss = criterion(logits, labels)
        print(f'{accelerator.device} Loss = {loss}\n')
        
        accelerator.backward(loss.mean())
        print(f'{accelerator.device} Gather reduced loss: {accelerator.gather_for_metrics(loss.mean())}\n')
        print(f'{accelerator.device} Gather loss with no reduction: {accelerator.gather_for_metrics(loss)}\n')
        
notebook_launcher(func, num_processes=2)

and here are the results:

According the results above, it seems that the loss does differ among processes. Thus in my opinion, calculating the training loss only on the main process maybe slighly not correct, as the main process could receive different dataset portions. To compute the loss across the entire dataset, I need to gather the unreduced loss from different processes and then calculate the mean loss.

So which one is correct, or have I made some mistakes?

Kaxder23 · July 20, 2024, 8:12pm

You’re correct; gathering the unreduced loss from all processes and then calculating the mean is more accurate. Logging only on the main process can lead to discrepancies due to different data portions.

Topic		Replies	Views
Accelerator.backward freeze DeepSpeed	1	56	February 24, 2025
How to collect the accuracy when running multi GPU model with accelerate? 🤗Accelerate	3	978	December 8, 2023
Clarification on training metrics 🤗Accelerate	0	482	February 10, 2023
Get the loss from all TPU cores 🤗Accelerate	2	1684	September 22, 2021
Worse performance using Accelerate 🤗Accelerate	0	1048	January 15, 2024

Question about calculating training loss of multi-GPU with Accelerate

Related topics