RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

pchhapolika · May 11, 2023, 2:14pm

The code is below. It runs on 1 GPU. But fails on 2 or more GPU.

from transformers import AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer, AutoModelForCausalLM
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
from torch.utils.data import TensorDataset, DataLoader,Dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("dolly-v2-3b")
model = AutoModelForCausalLM.from_pretrained("dolly-v2-3b",load_in_8bit=True,device_map='auto') 

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=50,
    prompt_tuning_init_text="Answer the question as truthfully as possible"
    tokenizer_name_or_path="dolly-v2-3b"
)

model = get_peft_model(model, peft_config)
model.to(device)

train_data = [
    {
        "context": "How to Link Credit Card to ICICI Bank Account Step 1: "
        "question": "How to add card?",
        "answer": "Relevant. To add your card you can follow these steps: Step 1: "
    },
    {
        "context": "The python programming language is "
        "question": "What is Python used for?",
        "answer": "Relevant. Python is used in many different fields in"
        }
        ]

def preprocess_function(examples):  
    tokenized_examples = tokenizer(
        examples["context"],
        examples["question"],
        truncation=True,
        max_length=2048,
        padding="max_length"
    )
    tokenized_examples['labels']=tokenizer(
        examples["answer"],
        truncation=True,
        max_length=2048,
        padding="max_length",
        return_tensors="pt")['input_ids'][0]
    
    return tokenized_examples

tokenized_train_data = [preprocess_function(example) for example in train_data]

class DemoDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        sample = self.data[idx]      
        item = {k: torch.tensor(v) for k, v in sample.items()}
        return item

dataset = DemoDataset(tokenized_train_data)

training_args = TrainingArguments(
        output_dir="results2",
        learning_rate=1e-5,
        per_device_train_batch_size=2,
        num_train_epochs=2,
        weight_decay=0.01,
        logging_steps=1000,
        save_strategy="epoch",
        logging_dir="logs2"
    )
trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=dataset,
      # data_collator=data_collator,
      tokenizer=tokenizer
  )
trainer.train()

ERROR

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _workeroutput = module(*input, **kwargs)
  File "python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
  File "python3.8/site-packages/peft/peft_model.py", 
line 723, in forward inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 
(when checking argument for argument tensors in method wrapper_CUDA_cat)

@ehalit

ehalit · May 12, 2023, 5:02am

I believe adding .to("cuda") should work, like this:

class DemoDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        sample = self.data[idx]      
        item = {k: torch.tensor(v).to("cuda") for k, v in sample.items()}
        return item

You need not specify which device you put the tensors on before runtime.

pchhapolika · May 12, 2023, 5:35am

Hi @ehalit : Still getting the error.

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File 
"python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs)
  File 
"python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_implreturn forward_call(*args, **kwargs)
  File "python3.8/site-packages/peft/peft_model.py", line 723, in forward
    inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 
(when checking argument for argument tensors in method wrapper_CUDA_cat)

ehalit · May 12, 2023, 9:37am

Multi GPU setups are tricky I couldn’t get it to work yet but I think I spotted the problem.

First, I did a manual placement of layers on the GPUs with this:

from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

The name of the layers can be found by printing the model:

>>> print(model)
PeftModelForCausalLM(
  (base_model): GPTNeoXForCausalLM(
    (gpt_neox): GPTNeoXModel(
      (embed_in): Embedding(50280, 2560)
      (layers): ModuleList(
        (0-31): 32 x GPTNeoXLayer(
          (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
          (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
          (attention): GPTNeoXAttention(
            (rotary_emb): RotaryEmbedding()
            (query_key_value): Linear8bitLt(in_features=2560, out_features=7680, bias=True)
            (dense): Linear8bitLt(in_features=2560, out_features=2560, bias=True)
          )
          (mlp): GPTNeoXMLP(
            (dense_h_to_4h): Linear8bitLt(in_features=2560, out_features=10240, bias=True)
            (dense_4h_to_h): Linear8bitLt(in_features=10240, out_features=2560, bias=True)
            (act): GELUActivation()
          )
        )
      )
      (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
    )
    (embed_out): Linear(in_features=2560, out_features=50280, bias=False)
  )
  (word_embeddings): Embedding(50280, 2560)
  (prompt_encoder): PromptEmbedding(
    (embedding): Embedding(50, 2560)
  )
)

Then, when I inspect the devices of the parameters with,

for i in model.named_parameters():
    print(f"{i[0]} -> {i[1].device}")

I get the following device placement (on a 4 GPU setup):

...
base_model.gpt_neox.layers.3.mlp.dense_h_to_4h.weight -> cuda:0
base_model.gpt_neox.layers.3.mlp.dense_h_to_4h.bias -> cuda:0
base_model.gpt_neox.layers.3.mlp.dense_4h_to_h.weight -> cuda:0
base_model.gpt_neox.layers.3.mlp.dense_4h_to_h.bias -> cuda:0
base_model.gpt_neox.layers.4.input_layernorm.weight -> cuda:1
base_model.gpt_neox.layers.4.input_layernorm.bias -> cuda:1
base_model.gpt_neox.layers.4.post_attention_layernorm.weight -> cuda:1
base_model.gpt_neox.layers.4.post_attention_layernorm.bias -> cuda:1
...

Notice how the input layer normalization of the next block is on a different device. Since the error message I receive indicates a problem in the layer normalization:

│ /home/sysadmin/miniconda3/envs/nlp/lib/python3.8/site-packages/torch/nn/functional.py:2515 in    │
│ layer_norm                                                                                       │
│                                                                                                  │
│   2512 │   │   return handle_torch_function(                                                     │
│   2513 │   │   │   layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, b  │
│   2514 │   │   )                                                                                 │
│ ❱ 2515 │   return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.c  │
│   2516                                                                                           │
│   2517                                                                                           │
│   2518 def group_norm(                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 
and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

I think the source of the problem is that the layer normaliztion and the MLP layers are not on the same device.

To be clear, I am not an expert in multi GPU training, I just got some models to work in my setup but I don’t fully understand what is happening under the hood.

pchhapolika · May 12, 2023, 2:03pm

Thank you so much @ehalit . This is wonderful to know that we can deep dive in this way. Any pointer who can help in this forum? Can you tag him/her please?

t192306 · May 15, 2023, 1:05pm

@pchhapolika - I am using a 8 GPU Cluster and I am encountering the same issue. Did you happen to get any solution yet? Thanks!

Hanaimei · June 2, 2023, 8:36am

@pchhapolika I am using a 4 GPU Cluster and I am encountering the same issue. Did you happen to get any solution yet? Thanks!

lucasjin · June 5, 2023, 7:25am

I think I know why, am I solved it.

ehalit · June 7, 2023, 11:22am

I encountered the same problem with loading the recently popularized tiiuae/falcon-7b-instruct model. I get the same error if I load the model with

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True, load_in_8bit=True, device_map='auto', torch_dtype=torch.bfloat16)

However, if I do not specify load_in_8bit=True the problem goes away and it works properly:

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True, device_map='auto', torch_dtype=torch.bfloat16)

I think it is because the repository does not have a dedicated 8 bit model.

So, I revised your script as

from transformers import AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer, AutoModelForCausalLM
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
from torch.utils.data import TensorDataset, DataLoader,Dataset
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b",device_map='auto') 

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=50,
    prompt_tuning_init_text="Answer the question as truthfully as possible",
    tokenizer_name_or_path="databricks/dolly-v2-3b"
)

model = get_peft_model(model, peft_config)

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

model(**tokenizer("Hello World", return_tensors="pt"))

and it works fine on my 4-GPU machine. The memory usage on each GPU is around 4 GB.

pchhapolika · June 8, 2023, 6:54am

Thank you so much @ehalit

Andi2022HH · June 23, 2023, 10:46am

I found the code great. Sorry maybe this is a stupid question, how can I use the generated “model(**tokenizer(“Hello World”, return_tensors=“pt”))” to get some text out of it?

ehalit · June 23, 2023, 12:12pm

Well, this line generates probability estimates for each word which need decoding to generate text. So, for the purpose of demonstration,

print(tokenizer.batch_decode(model(**tokenizer("Hello World", return_tensors="pt")).logits.argmax(-1))[0])

should work. But this means applying greedy decoding and predicting the most probable token without any randomization. Ideally, if you want to generate text for inference (not training) you should use the model.generate() function. Like so,

print(tokenizer.decode(model.generate(**tokenizer("Hello World", return_tensors="pt"))[0]))

ssdavidson · July 19, 2023, 5:33pm

@ehalit would you mind sharing your script for loading falcon-7b-instruct directly? I am still getting this same error when trying to load falcon-7b-instruct. Many thanks!

ehalit · July 20, 2023, 5:22am

Sure thing!

from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory
from torch.cuda.amp import autocast
import torch

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True, device_map='auto', torch_dtype=torch.bfloat16)

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["DecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["DecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"],
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

generation_kwargs = {
    "min_length": -1,
    "top_k": 0,
    "top_p": 0.85,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "min_new_tokens": 10,
    "max_new_tokens": 50,
    "eos_token_id": tokenizer.eos_token_id,
}


with autocast():
    print(tokenizer.decode(model.generate(tokenizer.encode("Hello World!", return_tensors="pt").to("cuda:0"), **generation_kwargs)[0]))

prabhat-ale · July 27, 2023, 6:16pm

ehalit
Can you write me for the multi layer gpu support for the GPTJ model
I am also facing the same problem lately and I am kind of stuck into it. Your help would be of great lead to the problem for me.

Here is my snapshot of code

import torch
import random
import pandas as pd
import config as cfg
from datasets import Dataset
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_int8_training
)
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
DataCollatorForLanguageModeling,
TrainingArguments,
Trainer,
)

def load_tokenize_and_split_datasets(tokenizer, max_length, train_ratio):
df = pd.read_csv(cfg.DATASET_PATH)
dataset = Dataset.from_pandas(df)
tokenized_datasets = dataset.map(lambda examples: tokenizer(examples[“text”],
padding=“max_length”,
truncation=True,
max_length=max_length),
batched=True)

num_samples = len(tokenized_datasets)
num_train_samples = int(train_ratio * num_samples)
# Randomly select samples for the training set
train_indices = random.sample(range(num_samples), num_train_samples)
train_dataset = tokenized_datasets.select(train_indices)
# Select the remaining samples for the testing set
eval_indices = list(set(range(num_samples)) - set(train_indices))
eval_dataset = tokenized_datasets.select(eval_indices)
return train_dataset, eval_dataset

def print_trainable_parameters(model):
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)

def train():

tokenizer = AutoTokenizer.from_pretrained(cfg.MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(cfg.MODEL_NAME, 
                                         torch_dtype = torch.float16)
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model)

training_args = TrainingArguments(
output_dir=cfg.OUTPUT_DIR,
learning_rate=cfg.LEARNING_RATE,
num_train_epochs=cfg.NUM_TRAIN_EPOCHS,
per_device_train_batch_size=cfg.PER_DEVICE_TRAIN_BATCH_SIZE,
per_device_eval_batch_size=cfg.PER_DEVICE_EVAL_BATCH_SIZE,
optim=cfg.OPTIM,
fp16=True,
gradient_accumulation_steps = cfg.GRADIENT_ACCUMULATION_STEPS,
warmup_ratio=cfg.WARMUP_RATIO,
lr_scheduler_type=cfg.LR_SCHEDULER_TYPE,
eval_steps=cfg.EVAL_STEPS,
save_strategy=cfg.SAVE_STRATEGY,
logging_dir=cfg.LOGGING_DIR,
logging_steps=cfg.LOGGING_STEPS,
)


# Create data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

train_datasets, eval_datasets = load_tokenize_and_split_datasets(tokenizer=tokenizer,
                                                                max_length=cfg.MAX_LENGTH,
                                                                train_ratio=cfg.TRAIN_RATIO)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_datasets,
    eval_dataset=eval_datasets,
    data_collator=data_collator,
)
# Start training
trainer.train()
# Save the trained model
trainer.save_model(cfg.OUTPUT_DIR)

if name == “main”:
train()

dts88 · August 10, 2023, 1:45am

I got the same issue and I made it.

Acutally I got only two GPUs, and I create a device map manually. Then I dispatch the model with my device map. And it works.

Braddy · August 18, 2023, 8:10am

Thanks for your deep dive. I made it after specifying the blocks containing residual connections as the “non-split” ones.

zjhJOJO · September 4, 2023, 8:52pm

Could you please provide me some hints？Thx

jinglishi0206 · October 25, 2023, 2:53am

same issue for me when i try to fine-tune llama2 on g5.24xlarge instance using SFTTrainer. but the error did not come back again after i restart notebook and then run training code. did not figure out the reason yet.

VikramKindo · December 13, 2023, 4:42am

@ehalit Can you do it for llama2 7B

LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)

Topic		Replies	Views
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! I am on a single T4 GPU 🤗Accelerate	6	1182	June 10, 2024
Fine tune "meta-llama/Llama-2-7b-hf" Bug:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward) Beginners	15	186	December 6, 2024
Multi-gpu inference llama-3.2 vision with QLoRA 🤗Accelerate	4	114	April 25, 2025
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! 🤗Transformers	2	170	March 25, 2025
LoRA Finetuning RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	4	54	June 16, 2025

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Related topics