Hello everyone
I am currently facing a challenge in my work:
I need to obtain the gradients w.r.t different data shards and adjust them for gradient descent.
However, doing autograd to different losses in a serial manner (as shown in the toy example below) is very slow.
So I want to calculate the loss shards on different GPUs in parallel and obtain their individual gradients.
The codes before revised are
import torch
import torch.nn as nn
import torch.optim as optim
class SimpleNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleNN, self).__init__()
self.layer1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.output_layer = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = self.layer1(x)
x = self.relu(x)
x = self.output_layer(x)
return x
input_size, hidden_size, output_size = 10, 5, 1
model = SimpleNN(input_size, hidden_size, output_size)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
batch_size = 5
input_data = torch.rand(batch_size, input_size)
losses = model(input_data)
# The pattern I want to accelerate
grads_list = []
for i, loss in enumerate(losses):
pseudo_grads = torch.autograd.grad(loss, model.parameters(), retain_graph=True)
grads_list.append(pseudo_grads)
custom_grads = func(grads_list) # custom accumulation
for p, grad in zip(model.parameters(), custom_grads):
p.grad = grad
optimizer.step()
I have spent an amount of time researching this issue but failed to find a scheme.
Your insights and suggestions would be greatly appreciated.