Model Randomness Introduced by DDP

I am using the huggingface Trainer with DDP

I am trying to debug why performance is worse the more GPUs I add to training.

I noticed this weird behaviour. I train on a single batch of 8 values and print out the loss on each value.

i) gradient accumulation 8 AND 1 GPU

ii) gradient accumulation 1 AND 8 GPUs

notice how the only dataset value they agree on the loss is the 0th GPU input. Its as if there is some additional randomness associated with the GPU number

“dataset” value is a hash of the input text in each case.

8 GPUs::

{'loss': 0.7977094054222107, 'losses': tensor([[0.7977]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [1277834], 'rank': '0', 'local_rank': '0'}
{'loss': 0.012322795577347279, 'losses': tensor([[0.0123]], device='cuda:1', grad_fn=<NegBackward0>), 'dataset': [1273165], 'rank': '1', 'local_rank': '1'}
{'loss': 1.1729881763458252, 'losses': tensor([[1.1730]], device='cuda:2', grad_fn=<NegBackward0>), 'dataset': [1406156], 'rank': '2', 'local_rank': '2'}
{'loss': 0.4811226427555084, 'losses': tensor([[0.4811]], device='cuda:3', grad_fn=<NegBackward0>), 'dataset': [1847032], 'rank': '3', 'local_rank': '3'}
{'loss': 7.391242027282715, 'losses': tensor([[7.3912]], device='cuda:4', grad_fn=<NegBackward0>), 'dataset': [1082392], 'rank': '4', 'local_rank': '4'}
{'loss': 4.4031829833984375, 'losses': tensor([[4.4032]], device='cuda:5', grad_fn=<NegBackward0>), 'dataset': [900771], 'rank': '5', 'local_rank': '5'}
{'loss': 0.451646625995636, 'losses': tensor([[0.4516]], device='cuda:6', grad_fn=<NegBackward0>), 'dataset': [1196583], 'rank': '6', 'local_rank': '6'}
{'loss': 1.757645606994629, 'losses': tensor([[1.7576]], device='cuda:7', grad_fn=<NegBackward0>), 'dataset': [986218], 'rank': '7', 'local_rank': '7'}


1 gpu (grad acc 8)::

same:
{'loss': 0.7977094054222107, 'losses': tensor([[0.7977]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [1277834], 'rank': '0', 'local_rank': '0'}

different:
{'loss': 0.019877666607499123, 'losses': tensor([[0.0199]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [1273165], 'rank': '0', 'local_rank': '0'}
{'loss': 0.514133095741272, 'losses': tensor([[0.5141]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [1406156], 'rank': '0', 'local_rank': '0'}
{'loss': 0.147588849067688, 'losses': tensor([[0.1476]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [1847032], 'rank': '0', 'local_rank': '0'}
{'loss': 4.904898643493652, 'losses': tensor([[4.9049]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [1082392], 'rank': '0', 'local_rank': '0'}
{'loss': 0.6921711564064026, 'losses': tensor([[0.6922]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [900771], 'rank': '0', 'local_rank': '0'}
{'loss': 0.2580835819244385, 'losses': tensor([[0.2581]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [1196583], 'rank': '0', 'local_rank': '0'}
{'loss': 0.19682921469211578, 'losses': tensor([[0.1968]], device='cuda:0', grad_fn=<NegBackward0>), 'dataset': [986218], 'rank': '0', 'local_rank': '0'}

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.