Multiple wandb outputs

aclifton314 · August 8, 2022, 9:55pm

I noticed that when I’m training a model using the Accelerate library that the number of syncable runs that get outputted to wandb is the same as the number of GPUs I configure Accelerate with. If I have 4 GPUs available to me and configure Accelerate to utilize 2 of them, there will be two output directories in the /path/to/my/project/wandb directory that I can sync and view the various plots.

These appear to be different directories but I was wondering if it were possible to have the wandb runs aggregated somehow so as to get a final single big picture of was took place during training?

muellerzr · August 9, 2022, 10:52am

Make sure when initializing your trackers its under an if accelerator.is_main_process as shown in this example (docs need to be updated, will do so today):

github.com

huggingface/accelerate/blob/main/examples/by_feature/tracking.py#L168-L172


      
          # New Code #
          # We need to initalize the trackers we use. Overall configurations can also be stored
          if args.with_tracking and accelerator.is_main_process:
              run = os.path.split(__file__)[-1].split(".")[0]
              accelerator.init_trackers(run, config)

aclifton314 · August 9, 2022, 4:13pm

@muellerzr Thank you very much for your response. Below is what I originally had:

rom accelerate import Accelerator
import wandb
import torch

class MyRun:
    def __init__(self) -> None:
        
        self.tracker_project_name = 'my_project'
        self.tracker_run_name = 'my_run'
        self.is_accelerate = True
        self.tracker_name = 'wandb'
        
        if self.is_accelerate:
            self.accelerator = Accelerator(log_with=[self.tracker_name])
            self.accelerator.init_trackers(self.tracker_project_name)
            self.accelerator.trackers[0].run.name = self.tracker_run_name
        
        # if self.is_accelerate:
        #     self.accelerator = Accelerator(log_with=[self.tracker_name])

        # if self.is_accelerate and self.accelerator.is_main_process:
        #     self.accelerator.init_trackers(self.tracker_project_name)
        #     self.accelerator.trackers[0].run.name = self.tracker_run_name
            
        else:
            wandb.init(project=self.tracker_project_name)
            wandb.run.name = self.tracker_run_name
            self.device = torch.device("cpu")

myrun = MyRun()

The output I get is:

accelerate launch tmp.py
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).ed)
wandb: Waiting for W&B process to finish... (success).ed)
wandb:                                                                                
wandb:                                                                                
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100211-1nh4853d
wandb: Find logs at: ./wandb/offline-run-20220809_100211-1nh4853d/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100211-3sq46azq
wandb: Find logs at: ./wandb/offline-run-20220809_100211-3sq46azq/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100211-apk9120l
wandb: Find logs at: ./wandb/offline-run-20220809_100211-apk9120l/logs

You can see that three offline-run directories are created. If I change the code to:

from accelerate import Accelerator
import wandb
import torch

class MyRun:
    def __init__(self) -> None:
        
        self.tracker_project_name = 'my_project'
        self.tracker_run_name = 'my_run'
        self.is_accelerate = True
        self.tracker_name = 'wandb'
        
        # if self.is_accelerate:
        #     self.accelerator = Accelerator(log_with=[self.tracker_name])
        #     self.accelerator.init_trackers(self.tracker_project_name)
        #     self.accelerator.trackers[0].run.name = self.tracker_run_name
        
        if self.is_accelerate:
            self.accelerator = Accelerator(log_with=[self.tracker_name])

        if self.is_accelerate and self.accelerator.is_main_process:
            self.accelerator.init_trackers(self.tracker_project_name)
            self.accelerator.trackers[0].run.name = self.tracker_run_name
            
        else:
            wandb.init(project=self.tracker_project_name)
            wandb.run.name = self.tracker_run_name
            self.device = torch.device("cpu")

myrun = MyRun()

I get the following output:

accelerate launch tmp.py
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).ed)
wandb: Waiting for W&B process to finish... (success).ed)
wandb:                                                                                
wandb:                                                                                
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100510-14o40pkk
wandb: Find logs at: ./wandb/offline-run-20220809_100510-14o40pkk/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100510-17jusbof
wandb: Find logs at: ./wandb/offline-run-20220809_100510-17jusbof/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100510-ydjk3ywn
wandb: Find logs at: ./wandb/offline-run-20220809_100510-ydjk3ywn/logs

I see that there are still three offline-run directories created, which corresponds to having configured accelerate to use 3 out of the 4 GPUs I have available to me.

What I presume would be happening if I were to write out a full training loop, would be each GPU sees different parts of the training data. As such, each call to something like wandb.log(my_var) would log the value of my_var based off which portion of the data it was calculated on (which would theoretically be different for each GPU).

Synicing these different offline-run directories to wandb would show three different trends of my_var, one for each portion of the data that was put on the GPUs. I’m more interested in the aggregate of my_var to get an overall complete picture of the training session. If my_var was something like the training classification accuracy of my model, I’d be able to answer the question “what is the training accuracy” from the aggregate and not from the 3 individual training accuracies logged in the offline-run directories.

Is it possible to do something like this? I appreciate all your help and I apologize if my questions are naive. I’m a new user of Accelerate.

morgan · August 16, 2022, 3:10pm

Hey @aclifton314 , you can use “grouping” in wandb to achieve what you’re trying to do I think.

One way is to pass the “group” argument to wandb.init with your own name or id for the experiment, e.g. “my_experiment_1”. Then you will be able to group runs from the same run into one in the UI, e.g. your plots can show the average loss instead of the loss for the 3 individual GPUs used in “my_experiment_1”

To pass arguments to wandb.init using Accelerate, you can use the init_kwargs keyword, and pass it a nested dictionary, in the below case, this passes an argument to the group argument in wandb.init

accelerator.init_trackers("my_project", config=hps, init_kwargs={"wandb":{"group":"my_experiment_1"}})

morgan · August 16, 2022, 3:10pm

Also for distributed training, make sure you are using wandb 0.13 (released last week), as it improves support for distributed training

aclifton314 · August 22, 2022, 8:12pm

@morgan I’m using accelerate v 0.9.0 and get the following error:

Traceback (most recent call last):
  File "/home/aclifton/rf_fp/run_training.py", line 544, in <module>
    run_training_pipeline(config_files_dict_list)
  File "/home/aclifton/rf_fp/run_training.py", line 32, in run_training_pipeline
    rffp_run = rffprun.RFFPRun(run_config_file_path)
  File "/home/aclifton/rf_fp/rffprun.py", line 56, in __init__
    self.accelerator.init_trackers(self.tracker_project_name, init_kwargs={self.tracker_name:{'group': self.tracker_run_name}})
TypeError: init_trackers() got an unexpected keyword argument 'init_kwargs'

Do I need to upgrade accelerate or is there another way to initialize those keyword arguments for wandb?

muellerzr · August 22, 2022, 8:22pm

You should update your Accelerate version to 0.12.0. init kwargs come from a recent update

aclifton314 · August 22, 2022, 8:26pm

@muellerzr @morgan done and done! works great! Thank you both!

Topic		Replies	Views
Wandb.watch in accelerate library 🤗Accelerate	6	2299	May 1, 2024
Limiting print and log statements 🤗Accelerate	11	3321	August 3, 2022
Weights & Biases sweep with multi gpu accelerate launch 🤗Accelerate	4	2650	May 28, 2024
Wandb tracker run and project specifier 🤗Accelerate	4	1673	June 27, 2022
Do Trainer and Callback get created multiple times in case of distributed setup 🤗Accelerate	1	235	December 11, 2024

Multiple wandb outputs

Related topics