Fine tune "meta-llama/Llama-2-7b-hf" Bug:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

Hello, I have a problem when I fine tune “meta-llama/Llama-2-7b-hf” for classification task.

from datasets import load_dataset
from transformers import DataCollatorWithPadding
from transformers import LlamaTokenizer, 
LlamaForSequenceClassification
import torch
from peft import get_peft_model, LoraConfig, TaskType
from transformers import Trainer, TrainingArguments
import evaluate
import numpy as np

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        print(labels.device)
        print(model.device)
        print(labels)
        print(inputs)
        outputs = model(**inputs)
        print(outputs)
        logits = outputs.get("logits")
        loss_fct = torch.nn.CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss


def compute_metrics(eval_pred):
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric = evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
 return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}


MAX_LEN = 512
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

dataset = load_dataset("mehdiiraqui/twitter_disaster")
data = dataset["train"].train_test_split(train_size=0.8, seed=42)
data["val"] = data.pop("test")
data["test"] = dataset["test"]

col_to_delete = ["id", "keyword", "location", "text"]  # Remove the undesired columns
pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[0])

llama_tokenizer = LlamaTokenizer.from_pretrained(llama_checkpoint)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token


def llama_preprocess_function(examples):
    return llama_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LEN)


llama_tokenized_datasets = data.map(llama_preprocess_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("target", "label")
llama_tokenized_datasets.set_format("torch")

llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)

llama_model = LlamaForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=llama_checkpoint,
  num_labels=2,
  device_map="auto",
  offload_folder="offload",
  trust_remote_code=True,
  torch_dtype=torch.float16,
)

llama_model.config.pad_token_id = llama_model.config.eos_token_id

llama_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=4, lora_alpha=16, lora_dropout=0.5, bias="none",
    target_modules=[
        "q_proj",
        "v_proj",
    ],
)

llama_model = get_peft_model(llama_model, llama_peft_config)
llama_model.print_trainable_parameters()


llama_model.cuda()

lr = 1e-4
batch_size = 2
num_epochs = 3
training_args = TrainingArguments(
    output_dir="llama-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
    fp16=True,
    gradient_checkpointing=True,
)

llama_trainer = WeightedCELossTrainer(
    model=llama_model,
    args=training_args,
    train_dataset=llama_tokenized_datasets['train'],
    eval_dataset=llama_tokenized_datasets["val"],
    data_collator=llama_data_collator,
    compute_metrics=compute_metrics
)


llama_trainer.train()

I got the RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward). Can anyone give me some hints to solve the problem?

1 Like

device_map=“auto”,

I think you could fix this if you did something about it, but then you’d run out of VRAM…
Maybe you could do something like manually DDP or quantize it to fit in one GPU.

I want to use multi-gpu to do the fine tuning. I have two 3090. It seems device_map=“auto” already makes the model load on different GPUs. But I still cannot figure out why I have the bug. Usually, we need to put the data and model on the same device. But when we call trainer here, it automatically makes the data on different GPU?

1 Like

Thanks. I think device_map=“auto” may be the key point. I am trying to solve the problem. I will let you know if I figure it out.

1 Like

I am totally confused. I follow the code from here. I notice that their hardware is a single A6000 with 48GB VRAM. I want to run the same code on two 3090 parallelized.

My updated code is here:

from datasets import load_dataset
from transformers import DataCollatorWithPadding
from transformers import LlamaTokenizer, LlamaForSequenceClassification
import torch
from peft import get_peft_model, LoraConfig, TaskType
from transformers import Trainer, TrainingArguments
import evaluate
import numpy as np

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        print(labels.device)
        print(model.device)
        print(labels)
        print(inputs)
        outputs = model(**inputs)
        print(outputs)
        logits = outputs.get("logits")
        loss_fct = torch.nn.CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
    return (loss, outputs) if return_outputs else loss


def compute_metrics(eval_pred):
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric = evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]

return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}


MAX_LEN = 512
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

dataset = load_dataset("mehdiiraqui/twitter_disaster")
data = dataset["train"].train_test_split(train_size=0.8, seed=42)
data["val"] = data.pop("test")
data["test"] = dataset["test"]

col_to_delete = ["id", "keyword", "location", "text"]  # Remove the undesired columns
pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[0])

llama_tokenizer = LlamaTokenizer.from_pretrained(llama_checkpoint)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token


def llama_preprocess_function(examples):
    return llama_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LEN)


llama_tokenized_datasets = data.map(llama_preprocess_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("target", "label")
llama_tokenized_datasets.set_format("torch")

llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)

llama_model = LlamaForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=llama_checkpoint,
  num_labels=2,
  device_map="auto",
  offload_folder="offload",
  trust_remote_code=True,
  torch_dtype=torch.float16,
)

llama_model.config.pad_token_id = llama_model.config.eos_token_id

llama_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=4, lora_alpha=16, lora_dropout=0.5, bias="none",
    target_modules=[
        "q_proj",
        "v_proj",
    ],
)

llama_model = get_peft_model(llama_model, llama_peft_config)
llama_model.print_trainable_parameters()


llama_model.cuda()

lr = 1e-4
batch_size = 2
num_epochs = 3
training_args = TrainingArguments(
    output_dir="llama-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
    fp16=True,
    gradient_checkpointing=True,
)

llama_trainer = WeightedCELossTrainer(
    model=llama_model,
    args=training_args,
    train_dataset=llama_tokenized_datasets['train'],
    eval_dataset=llama_tokenized_datasets["val"],
    data_collator=llama_data_collator,
    compute_metrics=compute_metrics
)


llama_trainer.train()

Specifically, I print out the device of model, inputs and labels.


The code stops ouputs=model(**inputs). And in the first batch, all samples are all in cuda:0. Since they are in the same device. Can anyone give me some ideas to solve the problem? I still found even if I use one GPU, the cuda is out of memory.

1 Like

For now, we should not trust the sample code. Depending on the version of the library, it may not work quite often.
If you manually allocate the GPU, there is a high possibility that it will work. I’ll look for a way to do it.

1 Like

I try to manually split the model into two parts and load them on two gpus like this way:


But it doesn’t work. The error seems a little bit different: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! .

1 Like

Try this. Use accelerate without ‘auto’.

Edit:
Also, I suspect that the data to be processed is not loaded to the corresponding GPU if it is not a trainer, but there is no place to specify it in the first place because it is a trainer?

Do I need to define my trainer and dataloader without using the trained provided by huggingface. In this case, I could ensure the data is loaded into GPU.

1 Like

I think there is a high possibility that it can be fixed by changing the accelerate (device_map=) settings. I think it just doesn’t work with auto. I think it’s a bug in the broad sense, but it’s too vague to raise an issue…
If it’s really impossible, you can manually assign it using torch’s DDP.

1 Like

GOOOOOOD NEWS. I have already figured it out and solved this problem.

The problem is the class:

It is modeling_llama.py of LlamaForSequenceClassification. You need to find it on your server.


I find the output logits are somehow moved to another cuda that is different from the labels, which make the loss computation failed. So, I add labels = labels.to(logits.device) to move make the labels and logits in the same device. Everything works fine.

2 Likes

I’m glad it’s been solved!
But what made the tensor move between the CUDA…? Is it this?
Or is there something lurking in the trainer library?

llama_model.print_trainable_parameters()


llama_model.cuda() # this?

lr = 1e-4

I guess the problem might be the output of the model. Since the model is distributed into two GPUs, it means some layers are in cuda:0 and other layers are in cuda:1. Assuming the labels are in cuda:0. When the output layers of llama2 are in cuda:1, the logits could be in a different place with labels which leads to this issue.

1 Like

I see. It seems likely. In this scenario, the workaround for moving relatively small labels makes sense.
Is there a weakness in the processing for multi-GPU in the model class generally…?
I may raise an issue on github after seeing whether this is limited to Llama or not.

Currently, I am fine-tuning llama-2 to do classification task based on two 3090. I think using multi-gpu can ensure a larger batch size and larger rank of lora, which can lead to a stable and precise result. I can do some test these days.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.