Hi Guys,
My question and context might a bit long, but I think it will help a lot of people trying to train large models with limited GPU memory under a DDP setup. After trying a lot of stuff (flash-attention, fp16, etc.), I am trying to leverage the property that setting requires_grad=False
saves GPU memory.
A most intuitive description of what I want is that:
- I have a neural network with two layers
layer1
andlayer2
. - I use two separate optimizers:
opt1
forlayer1
andopt2
forlayer2
. - During the training process, I alternate between the two optimizers and set the layers with
requires_grad=False
.
Then the following code works perfectly well under the single-GPU use-case and squeeze the training right below the GPU memory:
# initialization
net = ...
opt1, opt2 = ..., ...
# training
for step, data in enumerate(data_loader):
# alternating the parameter sets
if step % 2 == 0:
for param in net.layer1.parameters():
param.requires_grad = True
param.grad = torch.zeros_like(param.data)
for param in net.layer2.parameters():
param.requires_grad = False
del param.grad
opt = opt1
else:
for param in net.layer2.parameters():
param.requires_grad = True
param.grad = torch.zeros_like(param.data)
for param in net.layer1.parameters():
param.requires_grad = False
del param.grad
opt = opt2
loss = net(data)
loss.backward()
opt.step()
opt.zero_grad()
Then I try to switch to multi-GPU training with accelerate
as below, adding something supporting the accelerate
. Please note that I wrap accelerate
after each alternation of parameter set so that the DDP in pytorch can correctly register and reduce the gradients.
# initialization
net = ...
net = accelerator.prepare(net)
opt1, opt2 = ..., ...
for step, data in enumerate(data_loader):
unwrap_net = unwrap(net) # remove the accelerate wrapper
# alternating the parameter sets
if step % 2 == 0:
for param in unwrap_net.layer1.parameters():
param.requires_grad = True
param.grad = torch.zeros_like(param.data)
for param in unwrap_net.layer2.parameters():
param.requires_grad = False
del param.grad
opt = opt1
else:
for param in unwrap_net.layer2.parameters():
param.requires_grad = True
param.grad = torch.zeros_like(param.data)
for param in unwrap_net.layer1.parameters():
param.requires_grad = False
del param.grad
opt = opt2
new_net = accelerator.prepare(unwrap_net)
del net
del accelerator._models[0]
net = new_net
torch.cuda.empty_cache()
loss = net(data)
loss.backward()
opt.step()
opt.zero_grad()
Here’s the question: I found that alternating the parameter set to optimize occupies more GPU memory than training only layer1 or layer2 alone with Accelerate. Although it is just 1-2GB, I can no longer squeeze the large model into my GPU anymore ( ). So I am wondering:
- Are there any place I am doing wrong that causes this issue?
- Are there better approaches to alternating the parameter sets under a multi-GPU setting?
If you have read through here, my friend, I am truly grateful for your patience and help! May the force of optimization be with you!
Best,
Ziqi