RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

THX! your solution works.

Did you figure out a similar solution for llama2 7b?

I have already finetune a Llama2 model on a QA datastet a below is the code snippet for model loading for fine-tuning where I used device_map as “auto”. I have 2 GPUs and I am utilizing both during the finetuning

# Load the fine-tuned Llama-2 model

base_model = "meta-llama/Llama-2-7b-chat-hf"

compute_dtype = getattr(torch, "float32") quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=False, ) llma2_model = AutoModelForCausalLM.from_pretrained( base_model, quantization_config=quant_config, device_map="auto", load_in_4bit=True, ) llma2_model.config.use_cache = False llma2_model.config.pretraining_tp = 1
# Load the Llama-2 tokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True, fast=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"

I was saving model checkpoints during the process. Now I want to load a model checkpoint and further train it. However, when I try to load my model checkpoint in the same way I did with the hugging face llama2 7B model it generates the following error when the model is starting the training process.

ERROR:
`"name": "RuntimeError",`
`"message": "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)",`

Code used for loading the checkpoint:

base_model="/home/LLAMA2/checkpoints/checkP4/checkpoint-160000"

compute_dtype = getattr(torch, "float32") quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=False, ) llma2_model = AutoModelForCausalLM.from_pretrained( base_model, quantization_config=quant_config, device_map="auto", load_in_4bit=True, ) llma2_model.config.use_cache = False llma2_model.config.pretraining_tp = 1

# Load the Llama-2 tokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True, fast=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"

I have tried moving to device. using the below code

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") llma2_model.to(device)

However it gives error since I quantize the model.
`You shouldn't move a model when it is dispatched on multiple devices.Using device: cuda`

`ValueError: \`.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.`

Here is a sample of model parameters and its locations as you can see its on both devices GPU 0 and 1

model.layers.12.self_attn.k_proj.weight is on cuda:0 model.layers.12.self_attn.v_proj.lora_A.default.weight is on cuda:0 model.layers.12.self_attn.v_proj.lora_B.default.weight is on cuda:0 model.layers.12.self_attn.v_proj.base_layer.weight is on cuda:0 model.layers.12.self_attn.o_proj.weight is on cuda:0 model.layers.12.mlp.gate_proj.weight is on cuda:0 model.layers.12.mlp.up_proj.weight is on cuda:0 model.layers.12.mlp.down_proj.weight is on cuda:0 model.layers.12.input_layernorm.weight is on cuda:0 model.layers.12.post_attention_layernorm.weight is on cuda:0 model.layers.13.self_attn.q_proj.lora_A.default.weight is on cuda:1 model.layers.13.self_attn.q_proj.lora_B.default.weight is on cuda:1 model.layers.13.self_attn.q_proj.base_layer.weight is on cuda:1 model.layers.13.self_attn.k_proj.weight is on cuda:1

Can someone provide me a solution to this Issue. All I want is to load my saved checkpoint and proceed with further finetuning. but couldn’t find a solution. Really glad if someone can offer a solution.

try to specify the cuda device:
device = ‘cuda:0’

1 Like

Issue is I use a quantized model which doesnt allow me to use to device.
Furthermore, what Iam trying to do is further fine-tunining from a already saved checkpoint.