Alternating Parameters in Accelerate

ziqipang · March 11, 2024, 4:22am

Hi Guys,

My question and context might a bit long, but I think it will help a lot of people trying to train large models with limited GPU memory under a DDP setup. After trying a lot of stuff (flash-attention, fp16, etc.), I am trying to leverage the property that setting requires_grad=False saves GPU memory.

A most intuitive description of what I want is that:

I have a neural network with two layers layer1 and layer2.
I use two separate optimizers: opt1 for layer1 and opt2 for layer2.
During the training process, I alternate between the two optimizers and set the layers with requires_grad=False.

Then the following code works perfectly well under the single-GPU use-case and squeeze the training right below the GPU memory:

# initialization
net = ...
opt1, opt2 = ..., ...

# training
for step, data in enumerate(data_loader):
    # alternating the parameter sets
    if step % 2 == 0:
        for param in net.layer1.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in net.layer2.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt1
    else:
        for param in net.layer2.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in net.layer1.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt2
    
    loss = net(data)
    loss.backward()
    opt.step()
    opt.zero_grad()

Then I try to switch to multi-GPU training with accelerate as below, adding something supporting the accelerate . Please note that I wrap accelerate after each alternation of parameter set so that the DDP in pytorch can correctly register and reduce the gradients.

# initialization
net = ...
net = accelerator.prepare(net)
opt1, opt2 = ..., ...

for step, data in enumerate(data_loader):
    unwrap_net = unwrap(net) # remove the accelerate wrapper
    # alternating the parameter sets
    if step % 2 == 0:
        for param in unwrap_net.layer1.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in unwrap_net.layer2.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt1
    else:
        for param in unwrap_net.layer2.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in unwrap_net.layer1.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt2
    
    new_net = accelerator.prepare(unwrap_net)
    del net
    del accelerator._models[0] 
    net = new_net
    torch.cuda.empty_cache()
    
    loss = net(data)
    loss.backward()
    opt.step()
    opt.zero_grad()

Here’s the question: I found that alternating the parameter set to optimize occupies more GPU memory than training only layer1 or layer2 alone with Accelerate. Although it is just 1-2GB, I can no longer squeeze the large model into my GPU anymore ( ). So I am wondering:

Are there any place I am doing wrong that causes this issue?
Are there better approaches to alternating the parameter sets under a multi-GPU setting?

If you have read through here, my friend, I am truly grateful for your patience and help! May the force of optimization be with you!

Best,

Ziqi

Topic		Replies	Views
How to use specified GPUs with Accelerator to train the model? Beginners	14	18319	April 10, 2023
Trainer API for Model Parallelism on Multiple GPUs 🤗Transformers	4	2068	August 5, 2023
Distributed fine-tuning with frozen embedding layers Beginners	0	750	August 16, 2022
Setting requires_grad=False seems not saving GPU memory usage 🤗Transformers	0	129	January 18, 2024
Model's evaluation in DDP training is using only one GPU Beginners	1	774	September 14, 2023

Alternating Parameters in Accelerate

Related Topics