Clarification on training metrics

alex-hh · February 10, 2023, 7:07pm

In the example scripts (e.g. accelerate/complete_cv_example.py at main · huggingface/accelerate · GitHub), a variable total_loss is used to compute the average loss on training datapoints, which is then logged using accelerator.log

Is the resulting metric process-specific, or is the loss somehow aggregated across processes.

In the former case (in which the metric is the average loss for a single process), is there an alternative suggested way to compute the metric across all processes in training. I assume gather_for_metrics could be used, but will this induce any additional cost?

Thanks!

Topic		Replies	Views
Question about calculating training loss of multi-GPU with Accelerate 🤗Accelerate	1	901	July 20, 2024
What is the correct way to compute metrics while training using Accelerate? 🤗Accelerate	0	22	October 29, 2024
How loss/metric reporting works with deepspeed and transformers.Trainer? 🤗Accelerate	0	159	June 24, 2024
Early stopping implementation in accelerate? 🤗Accelerate	4	1653	September 7, 2022
Bug on multi-gpu trainer with accelerate 🤗Accelerate	6	664	February 18, 2025

Clarification on training metrics

Related topics