How can I set `max_memory` parameter while loading Quantized model with Model Pipeline class?

J4BEZ · March 18, 2025, 3:24am

Thanks for your kind response!
Unfortunately, downgrading the transformers version to 4.49.0 didn’t work for my case😭
It still shows the same error as below:

File "~/anaconda3/envs/sample_env/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 464, in _save_to_state_dict
    for k, v in self.weight.quant_state.as_dict(packed=True).items():
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/anaconda3/envs/sample_env/lib/python3.11/site-packages/bitsandbytes/functional.py", line 810, in as_dict
    "nested_offset": self.offset.item(),
                     ^^^^^^^^^^^^^^^^^^
NotImplementedError: aten::_local_scalar_dense: attempted to run this operator with Meta tensors, but there was no abstract impl or Meta kernel registered. You may have run into this message while using an operator with PT2 compilation APIs (torch.compile/torch.export); in order to use this operator with those APIs you'll need to add an abstract impl.Please see the following doc for next steps: https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU/edit

However, as you suggested, it works properly with unsloth/Qwen2.5-VL-3B-Instruct-unsloth-bnb-4bit!
It might be related to the weights.

Thank you again for your kind reply, and I hope you have a great day!

PS: If anyone else encounters a problem similar to mine, I hope this information helps, so I’m sharing my environment below.

Python 3.11.11
CUDA 12.1
torch==2.3.1+cu121
torchvision==0.18.1+cu121
accelerate==1.5.2
bitsandbytes==0.45.3
flash-attn==2.7.3
transformers==4.49.0

Topic		Replies	Views
"Out of memory" when loading quantized model 🤗Accelerate	1	1405	January 22, 2024
How can I make use of GPU manually to run inference faster? 🤗Transformers	3	37	April 22, 2025
Load_in_8bit vs. loading 8-bit quantized model 🤗Transformers	6	6866	May 13, 2024
Load quantized model in memory Beginners	1	595	December 8, 2023
Loading quantized model on CPU only 🤗Transformers	6	18684	February 3, 2025

How can I set `max_memory` parameter while loading Quantized model with Model Pipeline class?

Related topics