@muellerzr Thank you very much for your response. Below is what I originally had:
rom accelerate import Accelerator
import wandb
import torch
class MyRun:
def __init__(self) -> None:
self.tracker_project_name = 'my_project'
self.tracker_run_name = 'my_run'
self.is_accelerate = True
self.tracker_name = 'wandb'
if self.is_accelerate:
self.accelerator = Accelerator(log_with=[self.tracker_name])
self.accelerator.init_trackers(self.tracker_project_name)
self.accelerator.trackers[0].run.name = self.tracker_run_name
# if self.is_accelerate:
# self.accelerator = Accelerator(log_with=[self.tracker_name])
# if self.is_accelerate and self.accelerator.is_main_process:
# self.accelerator.init_trackers(self.tracker_project_name)
# self.accelerator.trackers[0].run.name = self.tracker_run_name
else:
wandb.init(project=self.tracker_project_name)
wandb.run.name = self.tracker_run_name
self.device = torch.device("cpu")
myrun = MyRun()
The output I get is:
accelerate launch tmp.py
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).ed)
wandb: Waiting for W&B process to finish... (success).ed)
wandb:
wandb:
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100211-1nh4853d
wandb: Find logs at: ./wandb/offline-run-20220809_100211-1nh4853d/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100211-3sq46azq
wandb: Find logs at: ./wandb/offline-run-20220809_100211-3sq46azq/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100211-apk9120l
wandb: Find logs at: ./wandb/offline-run-20220809_100211-apk9120l/logs
You can see that three offline-run directories are created. If I change the code to:
from accelerate import Accelerator
import wandb
import torch
class MyRun:
def __init__(self) -> None:
self.tracker_project_name = 'my_project'
self.tracker_run_name = 'my_run'
self.is_accelerate = True
self.tracker_name = 'wandb'
# if self.is_accelerate:
# self.accelerator = Accelerator(log_with=[self.tracker_name])
# self.accelerator.init_trackers(self.tracker_project_name)
# self.accelerator.trackers[0].run.name = self.tracker_run_name
if self.is_accelerate:
self.accelerator = Accelerator(log_with=[self.tracker_name])
if self.is_accelerate and self.accelerator.is_main_process:
self.accelerator.init_trackers(self.tracker_project_name)
self.accelerator.trackers[0].run.name = self.tracker_run_name
else:
wandb.init(project=self.tracker_project_name)
wandb.run.name = self.tracker_run_name
self.device = torch.device("cpu")
myrun = MyRun()
I get the following output:
accelerate launch tmp.py
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Tracking run with wandb version 0.12.17
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).ed)
wandb: Waiting for W&B process to finish... (success).ed)
wandb:
wandb:
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100510-14o40pkk
wandb: Find logs at: ./wandb/offline-run-20220809_100510-14o40pkk/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100510-17jusbof
wandb: Find logs at: ./wandb/offline-run-20220809_100510-17jusbof/logs
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/aclifton/rf_fp/wandb/offline-run-20220809_100510-ydjk3ywn
wandb: Find logs at: ./wandb/offline-run-20220809_100510-ydjk3ywn/logs
I see that there are still three offline-run directories created, which corresponds to having configured accelerate to use 3 out of the 4 GPUs I have available to me.
What I presume would be happening if I were to write out a full training loop, would be each GPU sees different parts of the training data. As such, each call to something like wandb.log(my_var)
would log the value of my_var
based off which portion of the data it was calculated on (which would theoretically be different for each GPU).
Synicing these different offline-run directories to wandb
would show three different trends of my_var
, one for each portion of the data that was put on the GPUs. I’m more interested in the aggregate of my_var
to get an overall complete picture of the training session. If my_var
was something like the training classification accuracy of my model, I’d be able to answer the question “what is the training accuracy” from the aggregate and not from the 3 individual training accuracies logged in the offline-run directories.
Is it possible to do something like this? I appreciate all your help and I apologize if my questions are naive. I’m a new user of Accelerate
.