BitsandBytes conflict with Accelerate

I’m running inference on a custom VLM derived model. Inference works fine when using the weights in their bfloat16 precision. However, when I try defining a BitsandBytes config, I receive errors that I suspect is due to conflicts between BitsandBytes and Accelerate, where Accelerate and BitsandBytes are both trying to set the compute device and hence generating the following stack trace.

Traceback (most recent call last):
  File "/home/tyr/RobotAI/openvla/scripts/extern/verify_prismatic.py", line 147, in <module>
    verify_prismatic()
  File "/home/tyr/miniforge3/envs/openvla/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/tyr/RobotAI/openvla/scripts/extern/verify_prismatic.py", line 97, in verify_prismatic
    vlm = AutoModelForVision2Seq.from_pretrained(
  File "/home/tyr/miniforge3/envs/openvla/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/tyr/miniforge3/envs/openvla/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3735, in from_pretrained
    dispatch_model(model, **device_map_kwargs)
  File "/home/tyr/miniforge3/envs/openvla/lib/python3.10/site-packages/accelerate/big_modeling.py", line 499, in dispatch_model
    model.to(device)
  File "/home/tyr/miniforge3/envs/openvla/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2670, in to
    raise ValueError(
ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

This is the code that generated the above stack trace:

    vlm = AutoModelForVision2Seq.from_pretrained(
        MODEL_PATH,
        attn_implementation="flash_attention_2",
        torch_dtype=torch.float16,
        quantization_config=BitsAndBytesConfig(load_in_4bit=True),
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )

I’ve checked that the model is not being moved with a .to() within my code, I’ve tried adding device_map=None, tried setting torch_dtype=auto, but none of them resolve the issue.

Has anyone encountered this error before or have some suggestions about what might be going wrong? Thanks!

1 Like

Hmm… How about device_map=β€œcuda” or device_map=β€œsequencial”?

I’ve tried setting device_map to cuda or sequential, no luck.

I checked the github issue that you mentioned and applied the code changes suggested by its associated pull request (my accelerate package is 1.5.1). This moved me past the value error, but I then encounted the next error

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

This error is generated when I try calling the generate() method of the AutoModelForVision2Seq model.

 gen_ids = vlm.generate(**inputs, do_sample=False, min_length=1, max_length=512)

The processor is used as so

inputs = processor(prompt, image).to(device, dtype=torch.bfloat16)

I’ve been trying to toggle on-off the parameters in the model (e.g. turning off flash-attn, turning off low_cpu_mem_usage), but the two devices RuntimeError still persists. Right now, I still can’t identify which part of the weights or inputs or operation are being performed on the cpu.

1 Like

Hmm… How about this for debugging?

inputs = processor(prompt, image).to(device, dtype=torch.bfloat16)
print(inpus.device)
print(model.device)

There wasn’t a inputs.device, but I did the following:

        for k, v in inputs.items():
            print(f"{k}: {v.device}")
        print(f"model device: {vlm.device}")

and the output is

input_ids: cuda:0
attention_mask: cuda:0
pixel_values: cuda:0
model device: cuda:0
1 Like

Hmm… It worked.

from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import torch
from transformers.image_utils import load_image

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    #attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
    torch_dtype=dtype,
    #low_cpu_mem_usage=True,
    quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=dtype),
    trust_remote_code=True
).to(device)

image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
prompt = "In: What action should the robot take to {<INSTRUCTION>}?\nOut:"

inputs = processor(prompt, image).to(device, dtype=dtype)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)
#  x = F.scaled_dot_product_attention(
#[ 4.62235402e-04  6.69859354e-03  4.95526172e-03 -6.52482310e-03
#  9.93723747e-03  1.28276732e-02  9.96078431e-01]

# accelerate                1.0.1
# bitsandbytes              0.45.1
# torch                     2.4.0+cu124
# transformers              4.49.0.dev0
1 Like

Well I tried doing version matching one at a time, and it seems that the transformer version was the issue. The project fixed the transformer version at 4.40.1, but upgrading to 4.49.0 resolved the issues.

The above github issue didn’t seem to apply to this case, surprisingly enough. When I was version matching, I removed and reinstalled accelerate and so wiped out the edits I had done, but both stock versions of accelerate (1.5.1, which was what I was using, as well as 1.0.1) work just fine. The bitsandbytes version didn’t seem to matter either (0.45.5, which was mine, and 0.45.1, which is yours).

I don’t know what the difference between transformers 4.40.1 and 4.49.0 is, but it probably is something to do with how transformers orchestrates between the accelerate and bitesandbytes packages.

Also, thank you very much for your time and help!! I really appreciate it!

1 Like