Clarification on training metrics

In the example scripts (e.g. accelerate/complete_cv_example.py at main · huggingface/accelerate · GitHub), a variable total_loss is used to compute the average loss on training datapoints, which is then logged using accelerator.log

Is the resulting metric process-specific, or is the loss somehow aggregated across processes.

In the former case (in which the metric is the average loss for a single process), is there an alternative suggested way to compute the metric across all processes in training. I assume gather_for_metrics could be used, but will this induce any additional cost?

Thanks!

1 Like