Do Trainer and Callback get created multiple times in case of distributed setup

asaha-cdcp · January 11, 2024, 8:14pm

Hi Everyone,

I am fine tuning an llm on multi gpu with accelerate. I also need to add a callback to the trainer to sample and log predictions using weights and biases.

class WandbLLMSampleCallback(WandbCallback):
    def __init__(
        self,
        trainer,
        test_dataset,
        num_samples=10,
        max_new_tokens=256,
        log_model="checkpoint",
    ):
        super().__init__()
       ...
    def on_evaluate(self, args, state, control, **kwargs):
        super().on_evaluate(args, state, control, **kwargs)
        # make sure evaluate is called only on the main process once
        if state.is_world_process_zero:
            records_table = self._samples_table(self.sample_dataset)
            self._wandb.log({"sample_predictions": records_table})
...

I observed that the on_evaluate method is called multiple times = to the number of processes.

My qs are:
In a distributed setup does accelerate create multiple trainer objects each with their own trainer state or one trainer object with multiple trainer states.

When we do trainer.add_callback(WandbInputLoggerCallback(tokenizer)) does each trainer get a new instance of the callback. If we already need to put a check like if state.is_world_process_zero in the callback then does it make sense to even create the redundant callback instances and it to the other trainers ? Or should we do

if trainer.accelerator.is_main_process:
    trainer.add_callback(WandbInputLoggerCallback(tokenizer))

Appreciate your advice on this.

Thanks
Anindya

VedaantJain · December 11, 2024, 3:06pm

could you share your code for callback? I am trying to do this but my code gets stuck during callback when using accelerate for generations?
thanks

Topic		Replies	Views
Tracker in distributed setting (single node DDP or multinode DDP) 🤗Accelerate	1	566	August 29, 2022
Hugging Face Trainer class with accelerate 🤗Accelerate	2	402	May 21, 2024
Weights & Biases sweep with multi gpu accelerate launch 🤗Accelerate	4	2664	May 28, 2024
Limiting print and log statements 🤗Accelerate	11	3360	August 3, 2022
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	17935	September 6, 2023

Do Trainer and Callback get created multiple times in case of distributed setup

Related topics