Alternating Parameters in Accelerate

Hi Guys,

My question and context might a bit long, but I think it will help a lot of people trying to train large models with limited GPU memory under a DDP setup. After trying a lot of stuff (flash-attention, fp16, etc.), I am trying to leverage the property that setting requires_grad=False saves GPU memory.

A most intuitive description of what I want is that:

  • I have a neural network with two layers layer1 and layer2.
  • I use two separate optimizers: opt1 for layer1 and opt2 for layer2.
  • During the training process, I alternate between the two optimizers and set the layers with requires_grad=False.

Then the following code works perfectly well under the single-GPU use-case and squeeze the training right below the GPU memory:

# initialization
net = ...
opt1, opt2 = ..., ...

# training
for step, data in enumerate(data_loader):
    # alternating the parameter sets
    if step % 2 == 0:
        for param in net.layer1.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in net.layer2.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt1
    else:
        for param in net.layer2.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in net.layer1.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt2
    
    loss = net(data)
    loss.backward()
    opt.step()
    opt.zero_grad()

Then I try to switch to multi-GPU training with accelerate as below, adding something supporting the accelerate . Please note that I wrap accelerate after each alternation of parameter set so that the DDP in pytorch can correctly register and reduce the gradients.

# initialization
net = ...
net = accelerator.prepare(net)
opt1, opt2 = ..., ...

for step, data in enumerate(data_loader):
    unwrap_net = unwrap(net) # remove the accelerate wrapper
    # alternating the parameter sets
    if step % 2 == 0:
        for param in unwrap_net.layer1.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in unwrap_net.layer2.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt1
    else:
        for param in unwrap_net.layer2.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in unwrap_net.layer1.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt2
    
    new_net = accelerator.prepare(unwrap_net)
    del net
    del accelerator._models[0] 
    net = new_net
    torch.cuda.empty_cache()
    
    loss = net(data)
    loss.backward()
    opt.step()
    opt.zero_grad()

Hereā€™s the question: I found that alternating the parameter set to optimize occupies more GPU memory than training only layer1 or layer2 alone with Accelerate. Although it is just 1-2GB, I can no longer squeeze the large model into my GPU anymore ( :sob: :sob: :sob:). So I am wondering:

  • Are there any place I am doing wrong that causes this issue?
  • Are there better approaches to alternating the parameter sets under a multi-GPU setting?

If you have read through here, my friend, I am truly grateful for your patience and help! May the force of optimization be with you!

Best,

Ziqi