RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

JasonDataScience · August 15, 2021, 3:19pm

I am working in a Google Coalab session with a HuggingFace DistilBERT model which I have fine tuned against some data.

I am getting the following error when I try to evaluate a restored copy of my model:-

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

I run the following piece of code TWICE. Once just after fitting the model, and then once after saving and restoring the model.

metric= load_metric("accuracy")
model.eval()
for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

If I run the evaluation straight after training there is no problem:-

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:10: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  # Remove the CWD from sys.path while we load stuff.
{'accuracy': 0.6692307692307692}

If I run the above code after saving and restoring the model then I get the error quoted above, the full traceback for which is:-

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:10: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  # Remove the CWD from sys.path while we load stuff.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-102-33bf1579632a> in <module>()
      4     batch = {k: v.to(device) for k, v in batch.items()}
      5     with torch.no_grad():
----> 6         outputs = model(**batch)
      7 
      8     logits = outputs.logits

8 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

The steps I take for saving and restoring are as follows:-

Write the model to the colab session’s local disc:-
Write from local disc (of the colab session) to Google Drive
Write back from Google Drive to the colab session’s local disc
Use the copy on the local drive to load the model

The code for step 1 has been adapted from that at run_glue.py and is as follows:-

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir = './a_local_copy/'

# Create output directory if needed

if not os.path.exists(output_dir):

    os.makedirs(output_dir)

#logger.info("Saving model checkpoint to %s", args.output_dir)

print("Saving model checkpoint to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.

# They can then be reloaded using `from_pretrained()`

model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training

model_to_save.save_pretrained(output_dir)

tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model

# torch.save(args, os.path.join(output_dir, 'training_args.bin'))

Step 4 is the straightforward:-

model = AutoModelForSequenceClassification.from_pretrained(output_dir)
tokenizer = AutoTokenizer.from_pretrained(output_dir)

I am happy to load further code if you could give me some guidance as to what would be useful.

ehalit · August 16, 2021, 4:38am

I think after you load the model, it is no longer on GPU, try
model = AutoModelForSequenceClassification.from_pretrained(output_dir).to(device)

JasonDataScience · August 16, 2021, 11:52am

Perfect - that fixed it - thank you Eyup

Guter · November 19, 2022, 4:36pm

Hello I am new here because I get the same message after installing Stable Diffusion 1.5. I have two GPU’s, one from Intel and my NVIDIA card. Apparently the installation does not recognize the correct card. Where can I paste your code above then, I’m not a PC professional nor do I have any programming skills. Thanks for help, been working on this for 2 days…

pchhapolika · May 11, 2023, 2:17pm

Hi @ehalit , I am also facing the similar issue when training on 2 or more GPU’s. Do I need to change my DemoDataset(Dataset): class? RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Kindly help, please!

Alyssssss · July 4, 2023, 8:41am

Hi~I have met the similar problem, when I try to realize the image-to-text generation demo of blip-2(blip-2 demo). As the network problem of loading, I use the offline model.

Here is my code:

import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

# setup device to use
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# load sample image
raw_image = Image.open("./demo.jpg")

# loads BLIP-2 pre-trained model
vis_processors = Blip2Processor.from_pretrained("xxx/models/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("xxx/models/blip2-flan-t5-xxl", device_map="auto")

raw_question = "Question: Which city is this?"
inputs = vis_processors(raw_image, raw_question, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(vis_processors.decode(out[0], skip_special_tokens=True))

The error information is

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0!

I have tried the solutions from issues , but it still have the same error.

The device condition I used is a 8-gpus-cluster of 3090.

Thanks a lot in advance.

Topic		Replies	Views
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select) Beginners	2	1492	July 30, 2023
Resume_from_checkpoints leads to RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! 🤗Transformers	4	481	November 13, 2023
RuntimeError - NPL with transformers book - 02_classification.ipynb Beginners	1	320	April 30, 2022
Vilt RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) in the new update Models	0	346	January 23, 2024
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 128]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed t 🤗Transformers	0	1358	July 20, 2023

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

Related Topics