RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

I am working in a Google Coalab session with a HuggingFace DistilBERT model which I have fine tuned against some data.

I am getting the following error when I try to evaluate a restored copy of my model:-

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

I run the following piece of code TWICE. Once just after fitting the model, and then once after saving and restoring the model.

metric= load_metric("accuracy")
model.eval()
for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

If I run the evaluation straight after training there is no problem:-

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:10: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  # Remove the CWD from sys.path while we load stuff.
{'accuracy': 0.6692307692307692}

If I run the above code after saving and restoring the model then I get the error quoted above, the full traceback for which is:-

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:10: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  # Remove the CWD from sys.path while we load stuff.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-102-33bf1579632a> in <module>()
      4     batch = {k: v.to(device) for k, v in batch.items()}
      5     with torch.no_grad():
----> 6         outputs = model(**batch)
      7 
      8     logits = outputs.logits

8 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

The steps I take for saving and restoring are as follows:-

  1. Write the model to the colab session’s local disc:-
  2. Write from local disc (of the colab session) to Google Drive
  3. Write back from Google Drive to the colab session’s local disc
  4. Use the copy on the local drive to load the model

The code for step 1 has been adapted from that at run_glue.py and is as follows:-

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir = './a_local_copy/'

# Create output directory if needed

if not os.path.exists(output_dir):

    os.makedirs(output_dir)

#logger.info("Saving model checkpoint to %s", args.output_dir)

print("Saving model checkpoint to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.

# They can then be reloaded using `from_pretrained()`

model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training

model_to_save.save_pretrained(output_dir)

tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model

# torch.save(args, os.path.join(output_dir, 'training_args.bin'))

Step 4 is the straightforward:-

model = AutoModelForSequenceClassification.from_pretrained(output_dir)
tokenizer = AutoTokenizer.from_pretrained(output_dir)

I am happy to load further code if you could give me some guidance as to what would be useful.

1 Like

I think after you load the model, it is no longer on GPU, try
model = AutoModelForSequenceClassification.from_pretrained(output_dir).to(device)

6 Likes

Perfect - that fixed it - thank you Eyup

1 Like

Hello I am new here because I get the same message after installing Stable Diffusion 1.5. I have two GPU’s, one from Intel and my NVIDIA card. Apparently the installation does not recognize the correct card. Where can I paste your code above then, I’m not a PC professional nor do I have any programming skills. Thanks for help, been working on this for 2 days…