Trainer.train throws RuntimeError: Expected all tensors to be on the same device

I have moved the model to ‘cuda’ and confirmed that the TrainingArguments object does pickup ‘cuda’ as a device but when I try to train it throws this error, here is the code…

Set device

device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
torch.cuda.current_device(), device

OUTPUT → (0, device(type=‘cuda’))

Create model and push to cuda

model = AutoModelForSequenceClassification.from_pretrained(MODEL_CKPT, num_labels=6).to(device)

Instantiate a TrainerArguments objectbatch_size = 64

logging_steps = len(emotion_encoded[‘train’]) // batch_size
epochs=2
learning_rate = 2e-5
output_dir = MODEL_CKPT + ‘-finetuned-emotion-sssingh’
args = TrainingArguments(output_dir=output_dir,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=learning_rate,
weight_decay=0.01,
num_train_epochs=epochs,
evaluation_strategy=‘epoch’,
disable_tqdm=False,
logging_steps=logging_steps,
log_level=‘error’)

args.device
OUTPUT → device(type=‘cuda’, index=0)

Instantiate a Trainer object and train model end-to-end

trainer = Trainer(model=model,
tokenizer=tokenizer,
args=args,
compute_metrics=performance_metric,
train_dataset=emotion_encoded[‘train’],
eval_dataset=emotion_encoded[‘validation’])

Train

trainer.train()

This throws this error …
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

TrainingArguments automatically sets the device to a GPU (cuda:0) if it is available.

You should not manually move your model to GPU. So remove all the .to() calls. After initializing you can verify that the GPU is being used by checking args.device.

If that is correct, and you are still experiencing an issue, it is possible that your custom function performance_metric does something with mixed tensors. In that case, please post the full error trace and the custom function.

1 Like

Hi Bram,

I get the same error message for the Trainer.train() command when I run the below code in Colab using GPU as the run type. However, when I use CPU as the run type it runs successfully without any errors. Do you have any advise as to how I can run the below in GPU without getting the “Expected all tensors to be on the same device error”

Colab Code

This may be the same issue as discussed here.

Thank you very much. The solution proposed in the link you provided worked perfectly.

1 Like