I’m new to “accelerate” and am trying to port some (working) code to support multi-GPU training. The full code is too lengthy to include, but I believe this is the relevant excerpt:
from accelerate import Accelerator device = Accelerator.device accelerator = Accelerator() # build model, choose optimizer and scheduler, build dataloaders, etc etc # ... # Specify a tensor needed by the loss function class_weights = torch.tensor(np.array([.1, .2, .3]), dtype=torch.float) # Put everything onto appropriate GPU (?) (class_weights, model, optimizer, scheduler, dataloaders["train"], dataloaders["test"], dataloaders["valid"] ) \ = accelerator.prepare(class_weights, model, optimizer, scheduler, dataloaders["train"], dataloaders["test"], dataloaders["valid"]) # Define training loop def train(model, optimizer, scheduler, weight): criterion = nn.CrossEntropyLoss(weight=weight) for epoch in range(10): model.train() with torch.set_grad_enabled(True): for bi, (inputs, labels) in enumerate(dataloaders["train"]): optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) # Actually train train(model, optimizer, scheduler, class_weights)
The code raises an exception when it gets to the
loss = criterion... line:
... File ~/Mirabolic/fezzik/raceblind/venv/lib/python3.10/site-packages/torch/nn/functional.py:3029, in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing) 3027 if size_average is not None or reduce is not None: 3028 reduction = _Reduction.legacy_get_string(size_average, reduce) -> 3029 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument weight in method wrapper_CUDA_nll_loss_forward)
I initially assumed that either
labels was somehow still on the CPU, but if I examine the variables right before the call to
criterion they both seem to be on the (same) GPU:
In : labels.is_cuda Out: True In : labels.get_device() Out: 0 In : outputs.is_cuda Out: True In : outputs.get_device() Out: 0
I’m not sure how to proceed and would be very grateful for any suggestions.
FWIW, I’m using PyTorch
2.0.1+cu117 and accelerate
0.23.0; the system has two V100 GPUs.