How to obatin gradients on different GPUs to do custom accumulations

Hello everyone

I am currently facing a challenge in my work:
I need to obtain the gradients w.r.t different data shards and adjust them for gradient descent.

However, doing autograd to different losses in a serial manner (as shown in the toy example below) is very slow.
So I want to calculate the loss shards on different GPUs in parallel and obtain their individual gradients.

The codes before revised are

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.output_layer = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.output_layer(x)
        return x


input_size, hidden_size, output_size = 10, 5, 1
model = SimpleNN(input_size, hidden_size, output_size)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

batch_size = 5
input_data = torch.rand(batch_size, input_size)
losses = model(input_data)

# The pattern I want to accelerate
grads_list = []
for i, loss in enumerate(losses):
    pseudo_grads = torch.autograd.grad(loss, model.parameters(), retain_graph=True)
    grads_list.append(pseudo_grads)

custom_grads = func(grads_list) # custom accumulation
for p, grad in zip(model.parameters(), custom_grads):
    p.grad = grad

optimizer.step()

I have spent an amount of time researching this issue but failed to find a scheme.
Your insights and suggestions would be greatly appreciated.