Fine tune "meta-llama/Llama-2-7b-hf" Bug:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

hust · December 5, 2024, 6:02am

Hello, I have a problem when I fine tune “meta-llama/Llama-2-7b-hf” for classification task.

from datasets import load_dataset
from transformers import DataCollatorWithPadding
from transformers import LlamaTokenizer, 
LlamaForSequenceClassification
import torch
from peft import get_peft_model, LoraConfig, TaskType
from transformers import Trainer, TrainingArguments
import evaluate
import numpy as np

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        print(labels.device)
        print(model.device)
        print(labels)
        print(inputs)
        outputs = model(**inputs)
        print(outputs)
        logits = outputs.get("logits")
        loss_fct = torch.nn.CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss


def compute_metrics(eval_pred):
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric = evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
 return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}


MAX_LEN = 512
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

dataset = load_dataset("mehdiiraqui/twitter_disaster")
data = dataset["train"].train_test_split(train_size=0.8, seed=42)
data["val"] = data.pop("test")
data["test"] = dataset["test"]

col_to_delete = ["id", "keyword", "location", "text"]  # Remove the undesired columns
pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[0])

llama_tokenizer = LlamaTokenizer.from_pretrained(llama_checkpoint)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token


def llama_preprocess_function(examples):
    return llama_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LEN)


llama_tokenized_datasets = data.map(llama_preprocess_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("target", "label")
llama_tokenized_datasets.set_format("torch")

llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)

llama_model = LlamaForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=llama_checkpoint,
  num_labels=2,
  device_map="auto",
  offload_folder="offload",
  trust_remote_code=True,
  torch_dtype=torch.float16,
)

llama_model.config.pad_token_id = llama_model.config.eos_token_id

llama_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=4, lora_alpha=16, lora_dropout=0.5, bias="none",
    target_modules=[
        "q_proj",
        "v_proj",
    ],
)

llama_model = get_peft_model(llama_model, llama_peft_config)
llama_model.print_trainable_parameters()


llama_model.cuda()

lr = 1e-4
batch_size = 2
num_epochs = 3
training_args = TrainingArguments(
    output_dir="llama-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
    fp16=True,
    gradient_checkpointing=True,
)

llama_trainer = WeightedCELossTrainer(
    model=llama_model,
    args=training_args,
    train_dataset=llama_tokenized_datasets['train'],
    eval_dataset=llama_tokenized_datasets["val"],
    data_collator=llama_data_collator,
    compute_metrics=compute_metrics
)


llama_trainer.train()

I got the RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward). Can anyone give me some hints to solve the problem?

John6666 · December 5, 2024, 6:50am

device_map=“auto”,

I think you could fix this if you did something about it, but then you’d run out of VRAM…
Maybe you could do something like manually DDP or quantize it to fit in one GPU.

github.com/huggingface/accelerate

Training with Accelerator Fails. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select)

opened 12:32PM - 09 Mar 23 UTC

closed 03:06PM - 10 Jul 23 UTC

ananda1996ai

I am trying to train a BLOOM-3B model on a setup with 8 GPUS of 20GB each. Th…e training code is similar to the tutorial here: [Distributed training with Accelerate](https://huggingface.co/docs/transformers/accelerate). There is no "main" function used in my code. The model is loaded with the device map "balanced_low_0" ``` if get_world_size() > 1: kwargs["device_map"] = "balanced_low_0" model = AutoModelForCausalLM.from_pretrained(model_name, **kwargs) ``` Some of the layers are frozen using `param.requires_grad = False` The accelerate config file I'm using is has the following parameters: ``` compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU gpu_ids : 0,1,2,3,4,5,6,7 downcast_bf16: 'no' machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 2 use_cpu: false ``` On launching the code with accelerate and the above config I get the following error: ``` File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward output = old_forward(*args, **kwargs) File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward return F.embedding(return F.embedding( File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding File "/data/rg_data/pct_mai/Users/Anandamoy/anaconda3/envs/mqa_new/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument index in method wrapper__index_select) return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:7! (when checking argument for argument index in method wrapper__index_select) ``` I have tried with both accelerator version 0.15.0 and 0.16.0 and the problem persists. Please help me understand what am I missing?

hust · December 5, 2024, 6:51am

I want to use multi-gpu to do the fine tuning. I have two 3090. It seems device_map=“auto” already makes the model load on different GPUs. But I still cannot figure out why I have the bug. Usually, we need to put the data and model on the same device. But when we call trainer here, it automatically makes the data on different GPU?

hust · December 5, 2024, 6:55am

Thanks. I think device_map=“auto” may be the key point. I am trying to solve the problem. I will let you know if I figure it out.

hust · December 5, 2024, 8:08am

I am totally confused. I follow the code from here. I notice that their hardware is a single A6000 with 48GB VRAM. I want to run the same code on two 3090 parallelized.

My updated code is here:

from datasets import load_dataset
from transformers import DataCollatorWithPadding
from transformers import LlamaTokenizer, LlamaForSequenceClassification
import torch
from peft import get_peft_model, LoraConfig, TaskType
from transformers import Trainer, TrainingArguments
import evaluate
import numpy as np

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        print(labels.device)
        print(model.device)
        print(labels)
        print(inputs)
        outputs = model(**inputs)
        print(outputs)
        logits = outputs.get("logits")
        loss_fct = torch.nn.CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
    return (loss, outputs) if return_outputs else loss


def compute_metrics(eval_pred):
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric = evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]

return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}


MAX_LEN = 512
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

dataset = load_dataset("mehdiiraqui/twitter_disaster")
data = dataset["train"].train_test_split(train_size=0.8, seed=42)
data["val"] = data.pop("test")
data["test"] = dataset["test"]

col_to_delete = ["id", "keyword", "location", "text"]  # Remove the undesired columns
pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[0])

llama_tokenizer = LlamaTokenizer.from_pretrained(llama_checkpoint)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token


def llama_preprocess_function(examples):
    return llama_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LEN)


llama_tokenized_datasets = data.map(llama_preprocess_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("target", "label")
llama_tokenized_datasets.set_format("torch")

llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)

llama_model = LlamaForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=llama_checkpoint,
  num_labels=2,
  device_map="auto",
  offload_folder="offload",
  trust_remote_code=True,
  torch_dtype=torch.float16,
)

llama_model.config.pad_token_id = llama_model.config.eos_token_id

llama_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=4, lora_alpha=16, lora_dropout=0.5, bias="none",
    target_modules=[
        "q_proj",
        "v_proj",
    ],
)

llama_model = get_peft_model(llama_model, llama_peft_config)
llama_model.print_trainable_parameters()


llama_model.cuda()

lr = 1e-4
batch_size = 2
num_epochs = 3
training_args = TrainingArguments(
    output_dir="llama-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
    fp16=True,
    gradient_checkpointing=True,
)

llama_trainer = WeightedCELossTrainer(
    model=llama_model,
    args=training_args,
    train_dataset=llama_tokenized_datasets['train'],
    eval_dataset=llama_tokenized_datasets["val"],
    data_collator=llama_data_collator,
    compute_metrics=compute_metrics
)


llama_trainer.train()

Specifically, I print out the device of model, inputs and labels.

The code stops ouputs=model(**inputs). And in the first batch, all samples are all in cuda:0. Since they are in the same device. Can anyone give me some ideas to solve the problem? I still found even if I use one GPU, the cuda is out of memory.

John6666 · December 5, 2024, 9:07am

For now, we should not trust the sample code. Depending on the version of the library, it may not work quite often.
If you manually allocate the GPU, there is a high possibility that it will work. I’ll look for a way to do it.

hust · December 5, 2024, 9:08am

I try to manually split the model into two parts and load them on two gpus like this way:

But it doesn’t work. The error seems a little bit different: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! .

John6666 · December 5, 2024, 9:09am

Try this. Use accelerate without ‘auto’.

Edit:
Also, I suspect that the data to be processed is not loaded to the corresponding GPU if it is not a trainer, but there is no place to specify it in the first place because it is a trainer?

hust · December 5, 2024, 9:17am

Do I need to define my trainer and dataloader without using the trained provided by huggingface. In this case, I could ensure the data is loaded into GPU.

John6666 · December 5, 2024, 9:37am

I think there is a high possibility that it can be fixed by changing the accelerate (device_map=) settings. I think it just doesn’t work with auto. I think it’s a bug in the broad sense, but it’s too vague to raise an issue…
If it’s really impossible, you can manually assign it using torch’s DDP.

hust · December 5, 2024, 11:57pm

GOOOOOOD NEWS. I have already figured it out and solved this problem.

The problem is the class:

It is modeling_llama.py of LlamaForSequenceClassification. You need to find it on your server.

I find the output logits are somehow moved to another cuda that is different from the labels, which make the loss computation failed. So, I add labels = labels.to(logits.device) to move make the labels and logits in the same device. Everything works fine.

John6666 · December 6, 2024, 2:04am

I’m glad it’s been solved!
But what made the tensor move between the CUDA…? Is it this?
Or is there something lurking in the trainer library?

llama_model.print_trainable_parameters()


llama_model.cuda() # this?

lr = 1e-4

github.com

huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1285


      
              if input_ids is not None:
                  # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
                  sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
                  sequence_lengths = sequence_lengths % input_ids.shape[-1]
                  sequence_lengths = sequence_lengths.to(logits.device)
              else:
                  sequence_lengths = -1
          
          pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
          
          loss = None
          if labels is not None:
              loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config)
          
          if not return_dict:
              output = (pooled_logits,) + transformer_outputs[1:]
              return ((loss,) + output) if loss is not None else output
          
          return SequenceClassifierOutputWithPast(
              loss=loss,
              logits=pooled_logits,

hust · December 6, 2024, 5:44am

I guess the problem might be the output of the model. Since the model is distributed into two GPUs, it means some layers are in cuda:0 and other layers are in cuda:1. Assuming the labels are in cuda:0. When the output layers of llama2 are in cuda:1, the logits could be in a different place with labels which leads to this issue.

John6666 · December 6, 2024, 5:53am

I see. It seems likely. In this scenario, the workaround for moving relatively small labels makes sense.
Is there a weakness in the processing for multi-GPU in the model class generally…?
I may raise an issue on github after seeing whether this is limited to Llama or not.

hust · December 6, 2024, 6:08am

Currently, I am fine-tuning llama-2 to do classification task based on two 3090. I think using multi-gpu can ensure a larger batch size and larger rank of lora, which can lead to a stable and precise result. I can do some test these days.

system · December 6, 2024, 6:09pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) DeepSpeed	5	3454	August 26, 2024
LoRA Finetuning RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	4	39	June 16, 2025
Fine-tunning llama2 with multiple GPU hugging face trainer 🤗Transformers	8	3365	March 7, 2024
LLama 70B not working Beginners	1	1348	August 8, 2023
Fine tunning llama2 with multiple GPUs and Hugging face trainer 🤗Transformers	1	3478	November 3, 2023

Fine tune "meta-llama/Llama-2-7b-hf" Bug:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

Related topics