How loss/metric reporting works with deepspeed and transformers.Trainer?

pg20sanger · June 24, 2024, 10:47am

I am training my model with deepspeed using hugging face Trainer API. I was wondering how Trainer/Accelerate handles logging of loss / compute_metrices. My observation is that deepspeed creates separate processes per gpu for training and each process has its own Trainer. With this observation I have the following hypothesis:

The acceleare will gather all the losses and avg them in one of the processes before logging them in tensor board.

So my questions are:

Is this hypothesis correct, or do I need additional handling for reporting if I am using deepspeed?
Is this hypothesis valid during evaluation phase as well?
Is this hypothesis valid for compute_metrics function as well? If not, how do I write process-safe compute_metrics function to report global statistics?

Thanks.

Topic		Replies	Views
Deepspeed trainer and custom loss weights DeepSpeed	1	556	February 28, 2024
Clarification on training metrics 🤗Accelerate	0	482	February 10, 2023
Trainer log my custom metrics at training step Beginners	3	3973	July 11, 2024
Trainer doesn't show the loss at each step 🤗Transformers	20	35313	May 9, 2024
Bug in gradient accumulation training_step in huggingface Trainer? 🤗Transformers	3	789	November 2, 2024

How loss/metric reporting works with deepspeed and transformers.Trainer?

Related topics