Question about calculating training loss of multi-GPU with Accelerate

I am currently training a model in Kaggle with Accelerate (2 T4 GPUs), and I’m confused about how to calculate or log the training loss correctly. According to some discussions
(https://github.com/huggingface/accelerate/issues/2109, https://github.com/huggingface/accelerate/issues/226)
and some official examples (complete_nlp_example), it appears that I may only need to log the training loss on the main process.

I do some test in Kaggle with the following code:

from accelerate import Accelerator, notebook_launcher
from accelerate.utils import set_seed
import torch

data = torch.arange(0, 9).float()
data.requires_grad = True
label = torch.randint(0, 10, (9,)).float()
dataset = list(zip(data, label))

def func():
    set_seed(1234)
    accelerator = Accelerator()
    criterion = torch.nn.L1Loss(reduction='none')
    test_dl = torch.utils.data.DataLoader(dataset, 6, shuffle=True)
    test_dl, criterion = accelerator.prepare(test_dl, criterion)
    
    for batch in test_dl:
        logits, labels = batch
        loss = criterion(logits, labels)
        print(f'{accelerator.device} Loss = {loss}\n')
        
        accelerator.backward(loss.mean())
        print(f'{accelerator.device} Gather reduced loss: {accelerator.gather_for_metrics(loss.mean())}\n')
        print(f'{accelerator.device} Gather loss with no reduction: {accelerator.gather_for_metrics(loss)}\n')
        
notebook_launcher(func, num_processes=2)

and here are the results:

According the results above, it seems that the loss does differ among processes. Thus in my opinion, calculating the training loss only on the main process maybe slighly not correct, as the main process could receive different dataset portions. To compute the loss across the entire dataset, I need to gather the unreduced loss from different processes and then calculate the mean loss.

So which one is correct, or have I made some mistakes?