Multi-gpu inference llama-3.2 vision with QLoRA

Hello :slight_smile:

After fine-tuning meta-llama/Llama-3.2-11B-Vision-Instruct I run into a weird error while running inference with multi-gpu.

This is how I loads the model:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage='bfloat16'
)

model = MllamaForConditionalGeneration.from_pretrained(
    model_path_or_name,
     quantization_config=bnb_config,
     device_map='auto',
     low_cpu_mem_usage=True
 )
model = PeftModel.from_pretrained(model, adapter_path)

The model is loaded into all available gpu - in my case 4.

Code for generation:

for i in trange(0, len(dataset), batch_size, desc="Inference"):
    batch = dataset[i: i + batch_size]
    batch = [dict(zip(batch.keys(), values)) for values in zip(*batch.values())]
    device_type = next(iter(model.parameters())).device.type

    with torch.no_grad():
        tokenized_batch = collator(batch)
        tokenized_batch = tokenized_batch.to(device_type)

        # text generation part
        raw_texts = generate(model, processor, tokenized_batch, terminators)

and I running on device mismatch error :
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

for other model like Qwen for example all works perfectly! this is only with MllamaForConditionalGeneration I was able to work around this by remove device_map from from_pretrained but that not a good one beacuse this is running only with single gpu.

BTW I running the scripts with python inference.py no accelerate or something else
any idea/suggestions?

1 Like

Well, since Accelerate is called with device_map=, it’s probably a problem with Accelerate and bitsandbytes…

And Llama 3.x sometimes has bugs when it’s not bfloat16. Also, device_map=“auto” doesn’t seem to get along well with bitsandbytes.

Well, in your case, it seems like you have enough VRAM even without device_map=, so I think that’s fine…

model = MllamaForConditionalGeneration.from_pretrained(
     model_path_or_name,
     torch_dtype=torch.bfloat16, # added
     quantization_config=bnb_config,
     #device_map='auto',
     device_map='sequential',
     low_cpu_mem_usage=True
 )

Can be optimized and trained through proxy ip thordata

Hi @John6666 thanks for your inputs.

I don’t think the issue lies in loading the model itself, as I can clearly see the model is being properly sharded across devices when using device_map='auto'. I should also mention that everything works perfectly when running inference with just the base model — it gets correctly sharded and the batch is distributed across the GPUs. The problem only arises after adding the PEFT adapter.

1 Like

I see, that certainly doesn’t seem to be the problem with from_pretrained. It might be the PEFT or the get_peft_model function that’s suspicious.

For now, I’ve found something that looks like bad know-how, but it’s not a solution…:sweat_smile:

1 Like