Weights & Biases sweep with multi gpu accelerate launch

Hi, I am trying to use Accelerate with multi-gpu on a single machine with a Weights and Biases sweep but I could not find any documentation specifically about this topic.

I tried to accomplish this with the following approach but I am getting errors:

In the main function, the accelerator is initialized as follows and the model parameters are taken from WANDB config. Then, the WAND config is setup and initialized as follows:

def main() -> None:
# Create accelerator for distributed training and logging
    accelerator = Accelerator(
        split_batches=True,
        mixed_precision="fp16",
        log_with="wandb",
    )

    # Initialize wandb run
    if accelerator.is_main_process:
        accelerator.init_trackers(
            project_name=WANDB_PROJECT,
            init_kwargs={
                "wandb": {
                    "entity": WANDB_ENTITY,
                    "dir": WANDB_EXPERIMENT_DIR,
                }
            },
        )

    # Log configuration from wandb tracker
    learning_rate = wandb.config.learning_rate
    base_model_name = wandb.config.base_model_name
    resize_resolution = wandb.config.resize_resolution
    model_type = MODEL_TYPES[base_model_name]

if __name__ == "__main__":

    # Define sweep config
    SWEEP_CONFIG = {
        "method": "random",
        "early_terminate": {
            "type": "hyperband",
            "min_iter": 3,
        },
        "name": "sweep",
        "metric": {"goal": "maximize", "name": "Valid Acc"},
        "parameters": {
            # Log-uniform requires min/max values specified as base-e exponents
            "learning_rate": {
                "values": [
                    1e-4,
                    1e-3,
                    1e-2,
                ]
            },
            "base_model_name": {
                "values": [
                    "facebook/convnext-tiny-224",
                    "facebook/convnext-small-224",
                    "microsoft/swinv2-tiny-patch4-window8-256",
                ]
            },
            "resize_resolution": {"values": ["model_base", "sd"]},
        },
    }

    # Initialize sweep by passing in config
    sweep_id = wandb.sweep(
        sweep=SWEEP_CONFIG,
        project=WANDB_PROJECT,
        entity=WANDB_ENTITY,
    )

    # Start sweep job.
    wandb.agent(sweep_id, function=main, count=1)

Later, from the CLI I configure the accelerator parameters with “accelerate config” and run the following command: accelerate launch

I am getting the following output and error:

Create sweep with ID: z18ok842
Sweep URL:
Create sweep with ID: 350at2y9
Sweep URL:
Create sweep with ID: sca3d4qa
Sweep URL:
wandb: Agent Starting Run: 1oaggrmg with config:
wandb: base_model_name: facebook/convnext-tiny-224
wandb: learning_rate: 0.01
wandb: resize_resolution: sd
wandb: Agent Starting Run: y3o3i8tn with config:
wandb: base_model_name: facebook/convnext-small-224
wandb: learning_rate: 0.001
wandb: resize_resolution: sd
wandb: Agent Starting Run: vdcqv0d4 with config:
wandb: base_model_name: facebook/convnext-small-224
wandb: learning_rate: 0.01
wandb: resize_resolution: model_base
Create sweep with ID: vp6bdpbb
Sweep URL:
wandb: Agent Starting Run: xw1qzfpw with config:
wandb: base_model_name: facebook/convnext-tiny-224
wandb: learning_rate: 0.01
wandb: resize_resolution: model_base
Run xw1qzfpw errored: Error(‘You must call wandb.init() before wandb.config.learning_rate’)
wandb: ERROR Run xw1qzfpw errored: Error(‘You must call wandb.init() before wandb.config.learning_rate’)
Run y3o3i8tn errored: Error(‘You must call wandb.init() before wandb.config.learning_rate’)
wandb: ERROR Run y3o3i8tn errored: Error(‘You must call wandb.init() before wandb.config.learning_rate’)
Run vdcqv0d4 errored: Error(‘You must call wandb.init() before wandb.config.learning_rate’)
wandb: ERROR Run vdcqv0d4 errored: Error(‘You must call wandb.init() before wandb.config.learning_rate’)

Do you have a recommendation on how to integrate accelerate with wandb sweep?

Thank you.

1 Like

Hey @berkin , I think this can be fixed by moving the following code under the test for whether its the main process, similar to how you have initialised wandb:

if main_process:    
    # Log configuration from wandb tracker
    learning_rate = wandb.config.learning_rate
    base_model_name = wandb.config.base_model_name
    resize_resolution = wandb.config.resize_resolution
    model_type = MODEL_TYPES[base_model_name]
2 Likes

Thank for your reply. I indeed made the modification you recommended but I still had to save the config to a json file so that all processes can access it:

if accelerator.is_main_process:
    shared_config = {
        "learning_rate": wandb.config.learning_rate,
        "base_model_name": wandb.config.base_model_name,
        "resize_resolution": wandb.config.resize_resolution,
    }

    # Share wandb.config from main process by writing it to disk
    json.dump(shared_config, open(SHARED_CONFIG_FILE, "w"))

accelerator.wait_for_everyone()

# Log configuration from wandb tracker
shared_config = json.load(open(SHARED_CONFIG_FILE, "r"))
learning_rate = shared_config["learning_rate"]
base_model_name = shared_config["base_model_name"]
resize_resolution = shared_config["resize_resolution"]

Hi,

I’m a bit confused about where to initialise the Accelerator:

  • If I initialise it inside the main function (the function that is passed to wandb.agent) then:
    • The first run in the sweep works, but for the second one I get the error: AcceleratorState has already been initialized and cannot be changed, restart your runtime completely and pass mixed_precision='bf16' to Accelerate().
      • Presumably this is due to calling accelerator = Accelerator(log_with="wandb", mixed_precision="bf16") again when the main function is run for a second time? Although @berkin this seems to be what you’re doing - did you see this error?
  • If I initialise it outside of the main function:
    • Again, the first run in the sweep works, but for the second one I get the error: You can't use same Accelerator() instance with multiple models when using DeepSpeed.
      • This occurs when calling accelerator.prepare(...) and seems to suggest that I should actually be creating a new Accelerator instance (i.e. by initialising again inside the main function).

@berkin @morgan I’d appreciate any suggestions; thanks!