RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

The code is below. It runs on 1 GPU. But fails on 2 or more GPU.

from transformers import AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer, AutoModelForCausalLM
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
from torch.utils.data import TensorDataset, DataLoader,Dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("dolly-v2-3b")
model = AutoModelForCausalLM.from_pretrained("dolly-v2-3b",load_in_8bit=True,device_map='auto') 

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=50,
    prompt_tuning_init_text="Answer the question as truthfully as possible"
    tokenizer_name_or_path="dolly-v2-3b"
)

model = get_peft_model(model, peft_config)
model.to(device)

train_data = [
    {
        "context": "How to Link Credit Card to ICICI Bank Account Step 1: "
        "question": "How to add card?",
        "answer": "Relevant. To add your card you can follow these steps: Step 1: "
    },
    {
        "context": "The python programming language is "
        "question": "What is Python used for?",
        "answer": "Relevant. Python is used in many different fields in"
        }
        ]

def preprocess_function(examples):  
    tokenized_examples = tokenizer(
        examples["context"],
        examples["question"],
        truncation=True,
        max_length=2048,
        padding="max_length"
    )
    tokenized_examples['labels']=tokenizer(
        examples["answer"],
        truncation=True,
        max_length=2048,
        padding="max_length",
        return_tensors="pt")['input_ids'][0]
    
    return tokenized_examples

tokenized_train_data = [preprocess_function(example) for example in train_data]

class DemoDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        sample = self.data[idx]      
        item = {k: torch.tensor(v) for k, v in sample.items()}
        return item

dataset = DemoDataset(tokenized_train_data)

training_args = TrainingArguments(
        output_dir="results2",
        learning_rate=1e-5,
        per_device_train_batch_size=2,
        num_train_epochs=2,
        weight_decay=0.01,
        logging_steps=1000,
        save_strategy="epoch",
        logging_dir="logs2"
    )
trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=dataset,
      # data_collator=data_collator,
      tokenizer=tokenizer
  )
trainer.train()

ERROR

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _workeroutput = module(*input, **kwargs)
  File "python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
  File "python3.8/site-packages/peft/peft_model.py", 
line 723, in forward inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 
(when checking argument for argument tensors in method wrapper_CUDA_cat)

@ehalit

1 Like

I believe adding .to("cuda") should work, like this:

class DemoDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        sample = self.data[idx]      
        item = {k: torch.tensor(v).to("cuda") for k, v in sample.items()}
        return item

You need not specify which device you put the tensors on before runtime.

2 Likes

Hi @ehalit : Still getting the error.

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File 
"python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs)
  File 
"python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_implreturn forward_call(*args, **kwargs)
  File "python3.8/site-packages/peft/peft_model.py", line 723, in forward
    inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 
(when checking argument for argument tensors in method wrapper_CUDA_cat)

Multi GPU setups are tricky :sweat_smile: I couldn’t get it to work yet but I think I spotted the problem.

First, I did a manual placement of layers on the GPUs with this:

from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

The name of the layers can be found by printing the model:

>>> print(model)
PeftModelForCausalLM(
  (base_model): GPTNeoXForCausalLM(
    (gpt_neox): GPTNeoXModel(
      (embed_in): Embedding(50280, 2560)
      (layers): ModuleList(
        (0-31): 32 x GPTNeoXLayer(
          (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
          (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
          (attention): GPTNeoXAttention(
            (rotary_emb): RotaryEmbedding()
            (query_key_value): Linear8bitLt(in_features=2560, out_features=7680, bias=True)
            (dense): Linear8bitLt(in_features=2560, out_features=2560, bias=True)
          )
          (mlp): GPTNeoXMLP(
            (dense_h_to_4h): Linear8bitLt(in_features=2560, out_features=10240, bias=True)
            (dense_4h_to_h): Linear8bitLt(in_features=10240, out_features=2560, bias=True)
            (act): GELUActivation()
          )
        )
      )
      (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
    )
    (embed_out): Linear(in_features=2560, out_features=50280, bias=False)
  )
  (word_embeddings): Embedding(50280, 2560)
  (prompt_encoder): PromptEmbedding(
    (embedding): Embedding(50, 2560)
  )
)

Then, when I inspect the devices of the parameters with,

for i in model.named_parameters():
    print(f"{i[0]} -> {i[1].device}")

I get the following device placement (on a 4 GPU setup):

...
base_model.gpt_neox.layers.3.mlp.dense_h_to_4h.weight -> cuda:0
base_model.gpt_neox.layers.3.mlp.dense_h_to_4h.bias -> cuda:0
base_model.gpt_neox.layers.3.mlp.dense_4h_to_h.weight -> cuda:0
base_model.gpt_neox.layers.3.mlp.dense_4h_to_h.bias -> cuda:0
base_model.gpt_neox.layers.4.input_layernorm.weight -> cuda:1
base_model.gpt_neox.layers.4.input_layernorm.bias -> cuda:1
base_model.gpt_neox.layers.4.post_attention_layernorm.weight -> cuda:1
base_model.gpt_neox.layers.4.post_attention_layernorm.bias -> cuda:1
...

Notice how the input layer normalization of the next block is on a different device. Since the error message I receive indicates a problem in the layer normalization:

β”‚ /home/sysadmin/miniconda3/envs/nlp/lib/python3.8/site-packages/torch/nn/functional.py:2515 in    β”‚
β”‚ layer_norm                                                                                       β”‚
β”‚                                                                                                  β”‚
β”‚   2512 β”‚   β”‚   return handle_torch_function(                                                     β”‚
β”‚   2513 β”‚   β”‚   β”‚   layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, b  β”‚
β”‚   2514 β”‚   β”‚   )                                                                                 β”‚
β”‚ ❱ 2515 β”‚   return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.c  β”‚
β”‚   2516                                                                                           β”‚
β”‚   2517                                                                                           β”‚
β”‚   2518 def group_norm(                                                                           β”‚
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 
and cuda:1! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

I think the source of the problem is that the layer normaliztion and the MLP layers are not on the same device.

To be clear, I am not an expert in multi GPU training, I just got some models to work in my setup but I don’t fully understand what is happening under the hood.

5 Likes

Thank you so much @ehalit . This is wonderful to know that we can deep dive in this way. :star_struck: Any pointer who can help in this forum? Can you tag him/her please?

@pchhapolika - I am using a 8 GPU Cluster and I am encountering the same issue. Did you happen to get any solution yet? Thanks!

@pchhapolika I am using a 4 GPU Cluster and I am encountering the same issue. Did you happen to get any solution yet? Thanks!

1 Like

I think I know why, am I solved it.

I encountered the same problem with loading the recently popularized tiiuae/falcon-7b-instruct model. I get the same error if I load the model with

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True, load_in_8bit=True, device_map='auto', torch_dtype=torch.bfloat16)

However, if I do not specify load_in_8bit=True the problem goes away and it works properly:

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True, device_map='auto', torch_dtype=torch.bfloat16)

I think it is because the repository does not have a dedicated 8 bit model.

So, I revised your script as

from transformers import AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer, AutoModelForCausalLM
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
from torch.utils.data import TensorDataset, DataLoader,Dataset
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-3b",device_map='auto') 

peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=50,
    prompt_tuning_init_text="Answer the question as truthfully as possible",
    tokenizer_name_or_path="databricks/dolly-v2-3b"
)

model = get_peft_model(model, peft_config)

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["GPTNeoXLayer", "GPTNeoXMLP"],
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

model(**tokenizer("Hello World", return_tensors="pt"))

and it works fine on my 4-GPU machine. The memory usage on each GPU is around 4 GB.

2 Likes

Thank you so much @ehalit

I found the code great. Sorry maybe this is a stupid question, how can I use the generated β€œmodel(**tokenizer(β€œHello World”, return_tensors=β€œpt”))” to get some text out of it?

Well, this line generates probability estimates for each word which need decoding to generate text. So, for the purpose of demonstration,

print(tokenizer.batch_decode(model(**tokenizer("Hello World", return_tensors="pt")).logits.argmax(-1))[0])

should work. But this means applying greedy decoding and predicting the most probable token without any randomization. Ideally, if you want to generate text for inference (not training) you should use the model.generate() function. Like so,

print(tokenizer.decode(model.generate(**tokenizer("Hello World", return_tensors="pt"))[0]))

@ehalit would you mind sharing your script for loading falcon-7b-instruct directly? I am still getting this same error when trying to load falcon-7b-instruct. Many thanks!

Sure thing!

from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import dispatch_model, infer_auto_device_map
from accelerate.utils import get_balanced_memory
from torch.cuda.amp import autocast
import torch

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct", trust_remote_code=True, device_map='auto', torch_dtype=torch.bfloat16)

max_memory = get_balanced_memory(
    model,
    max_memory=None,
    no_split_module_classes=["DecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"],
    dtype='float16',
    low_zero=False,
)

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["DecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"],
    dtype='float16'
)

model = dispatch_model(model, device_map=device_map)

generation_kwargs = {
    "min_length": -1,
    "top_k": 0,
    "top_p": 0.85,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "min_new_tokens": 10,
    "max_new_tokens": 50,
    "eos_token_id": tokenizer.eos_token_id,
}


with autocast():
    print(tokenizer.decode(model.generate(tokenizer.encode("Hello World!", return_tensors="pt").to("cuda:0"), **generation_kwargs)[0]))

ehalit
Can you write me for the multi layer gpu support for the GPTJ model
I am also facing the same problem lately and I am kind of stuck into it. Your help would be of great lead to the problem for me.

Here is my snapshot of code

import torch
import random
import pandas as pd
import config as cfg
from datasets import Dataset
from peft import (
LoraConfig,
get_peft_model,
prepare_model_for_int8_training
)
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
DataCollatorForLanguageModeling,
TrainingArguments,
Trainer,
)

def load_tokenize_and_split_datasets(tokenizer, max_length, train_ratio):
df = pd.read_csv(cfg.DATASET_PATH)
dataset = Dataset.from_pandas(df)
tokenized_datasets = dataset.map(lambda examples: tokenizer(examples[β€œtext”],
padding=β€œmax_length”,
truncation=True,
max_length=max_length),
batched=True)

num_samples = len(tokenized_datasets)
num_train_samples = int(train_ratio * num_samples)
# Randomly select samples for the training set
train_indices = random.sample(range(num_samples), num_train_samples)
train_dataset = tokenized_datasets.select(train_indices)
# Select the remaining samples for the testing set
eval_indices = list(set(range(num_samples)) - set(train_indices))
eval_dataset = tokenized_datasets.select(eval_indices)
return train_dataset, eval_dataset

def print_trainable_parameters(model):
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)

def train():

tokenizer = AutoTokenizer.from_pretrained(cfg.MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(cfg.MODEL_NAME, 
                                         torch_dtype = torch.float16)
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model)

training_args = TrainingArguments(
output_dir=cfg.OUTPUT_DIR,
learning_rate=cfg.LEARNING_RATE,
num_train_epochs=cfg.NUM_TRAIN_EPOCHS,
per_device_train_batch_size=cfg.PER_DEVICE_TRAIN_BATCH_SIZE,
per_device_eval_batch_size=cfg.PER_DEVICE_EVAL_BATCH_SIZE,
optim=cfg.OPTIM,
fp16=True,
gradient_accumulation_steps = cfg.GRADIENT_ACCUMULATION_STEPS,
warmup_ratio=cfg.WARMUP_RATIO,
lr_scheduler_type=cfg.LR_SCHEDULER_TYPE,
eval_steps=cfg.EVAL_STEPS,
save_strategy=cfg.SAVE_STRATEGY,
logging_dir=cfg.LOGGING_DIR,
logging_steps=cfg.LOGGING_STEPS,
)


# Create data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

train_datasets, eval_datasets = load_tokenize_and_split_datasets(tokenizer=tokenizer,
                                                                max_length=cfg.MAX_LENGTH,
                                                                train_ratio=cfg.TRAIN_RATIO)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_datasets,
    eval_dataset=eval_datasets,
    data_collator=data_collator,
)
# Start training
trainer.train()
# Save the trained model
trainer.save_model(cfg.OUTPUT_DIR)

if name == β€œmain”:
train()

I got the same issue and I made it.

Acutally I got only two GPUs, and I create a device map manually. Then I dispatch the model with my device map. And it works.

1 Like

Thanks for your deep dive. I made it after specifying the blocks containing residual connections as the β€œnon-split” ones.

1 Like

Could you please provide me some hints?Thx

same issue for me when i try to fine-tune llama2 on g5.24xlarge instance using SFTTrainer. but the error did not come back again after i restart notebook and then run training code. did not figure out the reason yet. :joy:

@ehalit Can you do it for llama2 7B

LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)