Resume_from_checkpoints leads to RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!

Chrode · August 1, 2023, 8:16am

Environment info

adapter-transformers version: 3.2.1
transformers version: 4.26.1
Platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.10
Python version: 3.8.5
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 1.13.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Details

Hello ! I have trained a model with adapter-hub and saved the checkpoints. However when I try to resume the training I have the following error:

resume_from_checkpoint='./checkpoint-5500/'
trainer.train(resume_from_checkpoint=resume_from_checkpoint)  

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking  argument for argument mat1 in method wrapper_addmm)

I checked and inputs and model are correctly on cuda.
The traceback is the following:

trainer.train(resume_from_checkpoint=resume_from_checkpoint)                                                           
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/transformers/trainer.py", line 1543, in 
train                                                                                                                      
    return inner_training_loop(                                                                                            
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/transformers/trainer.py", line 1791, in 
_inner_training_loop                                                                                                       
    tr_loss_step = self.training_step(model, inputs)                                                                       
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/transformers/trainer.py", line 2539, in 
training_step                                                                                                              
    loss = self.compute_loss(model, inputs)    
File "/home/rodelc/TemBERTure_Tm_regression/CONFIG4:WEIGHT_DECAY_0.2/code/train.py", line 31, in compute_loss            
    outputs = model(**inputs)                                                                              
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194,
in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/transformers/adapters/models/bert/adapt$
r_model.py", line 85, in forward
    head_outputs = self.forward_head(
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/transformers/adapters/heads/base.py", l$
ne 833, in forward_head
    return_output = head_module(all_outputs, cls_output, attention_mask, return_dict, **kwargs)
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194,
in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/transformers/adapters/heads/base.py", l$
ne 143, in forward
    logits = super().forward(cls_output)
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/torch/nn/modules/container.py", line 20$
, in forward
    input = module(input)
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194,
in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/rodelc/anaconda3/envs/temBERTure_datavis/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, $
n forward
    return F.linear(input, self.weight, self.bias)

and my code is this:

class RegressionTrainer(AdapterTrainer): 
    '''
    1.Extract the "labels" from the inputs dictionary using the pop method. This suggests that the input dictionary contains a key named "labels" that corresponds to the ground truth labels for the regression task.
    2.Pass the remaining inputs dictionary to the model to obtain the model's outputs.
    3.Extract the logits from the outputs tensor. It assumes that the model's output is a tensor, and it retrieves the logits corresponding to the first element of each sample ([:, 0]).
    4.Compute the mean squared error (MSE) loss between the logits and the labels using the torch.nn.functional.mse_loss function. This calculates the squared difference between the predicted values and the ground truth labels.
    The method returns either the computed loss alone (loss) or a tuple containing the loss and the model's outputs ((loss, outputs)) based on the return_outputs flag.'''

        training_args = TrainingArguments(
                    output_dir = OUTPUT_DIR, # + '/'+'weigth'+ str(WEIGHT_DECAY) + '_lr' + str(LEARNING_RATE),
                    learning_rate = LEARNING_RATE, #LEARNING_RATE,
                num_train_epochs = EPOCHS,
                    evaluation_strategy = "epoch",
                    save_strategy = "epoch",
                    save_total_limit=2,
                    metric_for_best_model="loss",
                    load_best_model_at_end=True,
                    weight_decay = WEIGHT_DECAY,
                    #eval_accumulation_steps=50,
                    fp16=True,
                    report_to='wandb',
                    save_on_each_node=True,
                    greater_is_better=False,
                    seed = 42,
                    
                    #callback = [EarlyStoppingCallback(early_stopping_patience=3,early_stopping_threshold=(-3))]
                    )
                
 trainer = RegressionTrainer(
            model=model,
            args=training_args,
            train_dataset=ds["train"],
            eval_dataset=ds["validation"],
            compute_metrics=compute_metrics_for_regression,
           
            
        )    
trainer.add_callbacks=[EarlyStoppingCallback(early_stopping_patience=3,early_stopping_threshold=(-3))]
        
model.train_adapter(['TemBERTure_adapter']) #
      
old_collator = trainer.data_collator
trainer.data_collator = lambda data: dict(old_collator(data))
    
 trainer.train(resume_from_checkpoint=True)


  def compute_loss(self, model, inputs, return_outputs=False):
      labels = inputs.pop("labels")
      
      inputs_on_gpu = all(tensor.is_cuda for tensor in inputs.values())
      print("Are inputs on GPU?", inputs_on_gpu)
      model_on_gpu = next(model.parameters()).is_cuda
      print("Is model on GPU?", model_on_gpu)
      
      outputs = model(**inputs)
      logits = outputs[0][:, 0]
      loss = torch.nn.functional.mse_loss(logits, labels)
      return (loss, outputs) if return_outputs else loss

I have tried everything. Any ideas how to resume successfully the training? Thanks a lot!

lava18 · October 12, 2023, 4:25pm

I am having the same issue. Any help would be greatly appreciated!

Chrode · October 13, 2023, 4:03pm

unfortunately no updates

ckandemir · October 18, 2023, 4:46am

You can try to move labels tensor to the same device as the model and inputs:

labels = labels.to(logits.device)

and ensure it is on the same device in your compute_loss function:


def compute_loss(self, model, inputs, return_outputs=False):
    labels = inputs.pop("labels")
    labels = labels.to('cuda')  # ensuring labels are on the GPU
    
    inputs_on_gpu = all(tensor.is_cuda for tensor in inputs.values())
    print("Are inputs on GPU?", inputs_on_gpu)
    
    model_on_gpu = next(model.parameters()).is_cuda
    print("Is model on GPU?", model_on_gpu)
    
    outputs = model(**inputs)
    logits = outputs[0][:, 0]
    loss = torch.nn.functional.mse_loss(logits, labels)
    return (loss, outputs) if return_outputs else loss

Chrode · November 13, 2023, 11:59am

@ckandemir thanks for your suggestion but unfortunately it is not working

Topic		Replies	Views
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! 🤗Transformers	2	158	March 25, 2025
Unable to resume Multi GPU training from checkpoint SFT Trainer Models	2	217	November 6, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 🤗Accelerate	1	757	May 31, 2024
Training doesn't end properly but stops the machine with no error message 🤗Transformers	5	2336	January 15, 2024
Trainer.train throws RuntimeError: Expected all tensors to be on the same device Beginners	5	3338	May 17, 2023

Resume_from_checkpoints leads to RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!

Environment info

Details

Related topics