Loading quantized model on CPU only

ArcaneBlackwood · April 27, 2023, 8:53am

Im currently trying to run BloomZ 7b1 on a server with ~31GB available ram. Without quantization loading the model starts filling up swap, which is far from desirable. I tried enabling quantization with load_in_8bit:

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

modelPath = "/mnt/backup1/BLOOM/"

device = torch.device("cpu")
tokenizer = AutoTokenizer.from_pretrained(modelPath)
model = AutoModelForCausalLM.from_pretrained(modelPath, device_map="auto", local_files_only=True, load_in_8bit=True).to(device)

prompt = 'Write code for finding the prime number in python ?'
input_ids = tokenizer(prompt, return_tensors="pt").to(device)
output = model.generate(**input_ids, max_length=100, do_sample=True, top_k=50, temperature=0.25)
response = tokenizer.decode(output[0])

print(response)

And then fails with an eventual crash:

Log

Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Eithe
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA_SETUP: Loading binary /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/connor/.local/lib/python3.10/site-packages/bitsandbytes/cextension.py:43: UserWarning: The installed version of bitsandbytes was compiled
  warn(
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/connor/Workspace/chatgpt/bloom/main.py:9 in <module>                                       │
│                                                                                                  │
│    6 print("Loaded Torch")                                                                       │
│    7 tokenizer = AutoTokenizer.from_pretrained("/mnt/backup1/BLOOM/")                            │
│    8 print("Loaded Tokenizer")                                                                   │
│ ❱  9 model = AutoModelForCausalLM.from_pretrained("/mnt/backup1/BLOOM", device_map={"lm_head"    │
│   10 print("Loaded Model")                                                                       │
│   11                                                                                             │
│   12 prompt = 'Write code for finding the prime number in python ?'                              │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:471 in │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   468 │   │   │   )                                                                              │
│   469 │   │   elif type(config) in cls._model_mapping.keys():                                    │
│   470 │   │   │   model_class = _get_model_class(config, cls._model_mapping)                     │
│ ❱ 471 │   │   │   return model_class.from_pretrained(                                            │
│   472 │   │   │   │   pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs,   │
│   473 │   │   │   )                                                                              │
│   474 │   │   raise ValueError(                                                                  │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:2583 in          │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   2580 │   │   │   keep_in_fp32_modules = []                                                     │
│   2581 │   │                                                                                     │
│   2582 │   │   if load_in_8bit:                                                                  │
│ ❱ 2583 │   │   │   from .utils.bitsandbytes import get_keys_to_not_convert, replace_8bit_linear  │
│   2584 │   │   │                                                                                 │
│   2585 │   │   │   load_in_8bit_skip_modules = quantization_config.llm_int8_skip_modules         │
│   2586 │   │   │   load_in_8bit_threshold = quantization_config.llm_int8_threshold               │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py:7 in         │
│ <module>                                                                                         │
│                                                                                                  │
│     4                                                                                            │
│     5                                                                                            │
│     6 if is_bitsandbytes_available():                                                            │
│ ❱   7 │   import bitsandbytes as bnb                                                             │
│     8 │   import torch                                                                           │
│     9 │   import torch.nn as nn                                                                  │
│    10                                                                                            │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/__init__.py:6 in <module>          │
│                                                                                                  │
│    3 # This source code is licensed under the MIT license found in the                           │
│    4 # LICENSE file in the root directory of this source tree.                                   │
│    5                                                                                             │
│ ❱  6 from .autograd._functions import (                                                          │
│    7 │   MatmulLtState,                                                                          │
│    8 │   bmm_cublas,                                                                             │
│    9 │   matmul,                                                                                 │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:170 in      │
│ <module>                                                                                         │
│                                                                                                  │
│   167                                                                                            │
│   168                                                                                            │
│   169 @dataclass                                                                                 │
│ ❱ 170 class MatmulLtState:                                                                       │
│   171 │   CB = None                                                                              │
│   172 │   CxB = None                                                                             │
│   173 │   SB = None                                                                              │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:189 in      │
│ MatmulLtState                                                                                    │
│                                                                                                  │
│   186 │   is_training = True                                                                     │
│   187 │   has_fp16_weights = True                                                                │
│   188 │   use_pool = False                                                                       │
│ ❱ 189 │   formatB = F.get_special_format_str()                                                   │
│   190 │                                                                                          │
│   191 │   def reset_grads(self):                                                                 │
│   192 │   │   self.CB = None                                                                     │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/functional.py:1684 in              │
│ get_special_format_str                                                                           │
│                                                                                                  │
│   1681                                                                                           │
│   1682                                                                                           │
│   1683 def get_special_format_str():                                                             │
│ ❱ 1684 │   major, minor = torch.cuda.get_device_capability()                                     │
│   1685 │   if major < 7:                                                                         │
│   1686 │   │   print(                                                                            │
│   1687 │   │   │   f"Device with CUDA capability of {major} not supported for 8-bit matmul. Dev  │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:381 in                   │
│ get_device_capability                                                                            │
│                                                                                                  │
│    378 │   Returns:                                                                              │
│    379 │   │   tuple(int, int): the major and minor cuda capability of the device                │
│    380 │   """                                                                                   │
│ ❱  381 │   prop = get_device_properties(device)                                                  │
│    382 │   return prop.major, prop.minor                                                         │
│    383                                                                                           │
│    384                                                                                           │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:395 in                   │
│ get_device_properties                                                                            │
│                                                                                                  │
│    392 │   Returns:                                                                              │
│    393 │   │   _CudaDeviceProperties: the properties of the device                               │
│    394 │   """                                                                                   │
│ ❱  395 │   _lazy_init()  # will define _get_device_properties                                    │
│    396 │   device = _get_device_index(device, optional=True)                                     │
│    397 │   if device < 0 or device >= device_count():                                            │
│    398 │   │   raise AssertionError("Invalid device id")                                         │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:247 in _lazy_init        │
│                                                                                                  │
│    244 │   │   # are found or any other error occurs                                             │
│    245 │   │   if 'CUDA_MODULE_LOADING' not in os.environ:                                       │
│    246 │   │   │   os.environ['CUDA_MODULE_LOADING'] = 'LAZY'                                    │
│ ❱  247 │   │   torch._C._cuda_init()                                                             │
│    248 │   │   # Some of the queued calls may reentrantly call _lazy_init();                     │
│    249 │   │   # we need to just return without initializing in that case.                       │
│    250 │   │   # However, we must not let any *other* threads in!                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com

Looks like its trying to access a CUDA API that just isnt there, I do not need this! Im assuming there is SOME way to force loading of the model to use the CPU for all tasks, even if slow.

I want a script that forces the use of CPU, that loads BloomZ from a local repo folder and that quantizes the model to 8-bit while loading to prevent out of memory errors.

How do I do this?

chanansh · June 1, 2023, 7:19pm

ArcaneBlackwood:

ibcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA_SETUP: Loading binary /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/connor/.local/lib/python3.10/site-packages/bitsandbytes/cextension.py:43: UserWarning: The installed version of bitsandbytes was compiled
  warn(
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/connor/Workspace/chatgpt/bloom/main.py:9 in <module>                                       │
│                                                                                                  │
│    6 print("Loaded Torch")                                                                       │
│    7 tokenizer = AutoTokenizer.from_pretrained("/mnt/backup1/BLOOM/")                            │
│    8 print("Loaded Tokenizer")                                                                   │
│ ❱  9 model = AutoModelForCausalLM.from_pretrained("/mnt/backup1/BLOOM", device_map={"lm_head"    │
│   10 print("Loaded Model")                                                                       │
│   11                                                                                             │
│   12 prompt = 'Write code for finding the prime number in python ?'                              │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:471 in │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   468 │   │   │   )                                                                              │
│   469 │   │   elif type(config) in cls._model_mapping.keys():                                    │
│   470 │   │   │   model_class = _get_model_class(config, cls._model_mapping)                     │
│ ❱ 471 │   │   │   return model_class.from_pretrained(                                            │
│   472 │   │   │   │   pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs,   │
│   473 │   │   │   )                                                                              │
│   474 │   │   raise ValueError(                                                                  │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/transformers/modeling_utils.py:2583 in          │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   2580 │   │   │   keep_in_fp32_modules = []                                                     │
│   2581 │   │                                                                                     │
│   2582 │   │   if load_in_8bit:                                                                  │
│ ❱ 2583 │   │   │   from .utils.bitsandbytes import get_keys_to_not_convert, replace_8bit_linear  │
│   2584 │   │   │                                                                                 │
│   2585 │   │   │   load_in_8bit_skip_modules = quantization_config.llm_int8_skip_modules         │
│   2586 │   │   │   load_in_8bit_threshold = quantization_config.llm_int8_threshold               │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py:7 in         │
│ <module>                                                                                         │
│                                                                                                  │
│     4                                                                                            │
│     5                                                                                            │
│     6 if is_bitsandbytes_available():                                                            │
│ ❱   7 │   import bitsandbytes as bnb                                                             │
│     8 │   import torch                                                                           │
│     9 │   import torch.nn as nn                                                                  │
│    10                                                                                            │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/__init__.py:6 in <module>          │
│                                                                                                  │
│    3 # This source code is licensed under the MIT license found in the                           │
│    4 # LICENSE file in the root directory of this source tree.                                   │
│    5                                                                                             │
│ ❱  6 from .autograd._functions import (                                                          │
│    7 │   MatmulLtState,                                                                          │
│    8 │   bmm_cublas,                                                                             │
│    9 │   matmul,                                                                                 │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:170 in      │
│ <module>                                                                                         │
│                                                                                                  │
│   167                                                                                            │
│   168                                                                                            │
│   169 @dataclass                                                                                 │
│ ❱ 170 class MatmulLtState:                                                                       │
│   171 │   CB = None                                                                              │
│   172 │   CxB = None                                                                             │
│   173 │   SB = None                                                                              │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:189 in      │
│ MatmulLtState                                                                                    │
│                                                                                                  │
│   186 │   is_training = True                                                                     │
│   187 │   has_fp16_weights = True                                                                │
│   188 │   use_pool = False                                                                       │
│ ❱ 189 │   formatB = F.get_special_format_str()                                                   │
│   190 │                                                                                          │
│   191 │   def reset_grads(self):                                                                 │
│   192 │   │   self.CB = None                                                                     │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/bitsandbytes/functional.py:1684 in              │
│ get_special_format_str                                                                           │
│                                                                                                  │
│   1681                                                                                           │
│   1682                                                                                           │
│   1683 def get_special_format_str():                                                             │
│ ❱ 1684 │   major, minor = torch.cuda.get_device_capability()                                     │
│   1685 │   if major < 7:                                                                         │
│   1686 │   │   print(                                                                            │
│   1687 │   │   │   f"Device with CUDA capability of {major} not supported for 8-bit matmul. Dev  │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:381 in                   │
│ get_device_capability                                                                            │
│                                                                                                  │
│    378 │   Returns:                                                                              │
│    379 │   │   tuple(int, int): the major and minor cuda capability of the device                │
│    380 │   """                                                                                   │
│ ❱  381 │   prop = get_device_properties(device)                                                  │
│    382 │   return prop.major, prop.minor                                                         │
│    383                                                                                           │
│    384                                                                                           │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:395 in                   │
│ get_device_properties                                                                            │
│                                                                                                  │
│    392 │   Returns:                                                                              │
│    393 │   │   _CudaDeviceProperties: the properties of the device                               │
│    394 │   """                                                                                   │
│ ❱  395 │   _lazy_init()  # will define _get_device_properties                                    │
│    396 │   device = _get_device_index(device, optional=True)                                     │
│    397 │   if device < 0 or device >= device_count():                                            │
│    398 │   │   raise AssertionError("Invalid device id")                                         │
│                                                                                                  │
│ /home/connor/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:247 in _lazy_init        │
│                                                                                                  │
│    244 │   │   # are found or any other error occurs                                             │
│    245 │   │   if 'CUDA_MODULE_LOADING' not in os.environ:                                       │
│    246 │   │   │   os.environ['CUDA_MODULE_LOADING'] = 'LAZY'                                    │
│ ❱  247 │   │   torch._C._cuda_init()                                                             │
│    248 │   │   # Some of the queued calls may reentrantly call _lazy_init();                     │
│    249 │   │   # we need to just return without initializing in that case.                       │
│    250 │   │   # However, we must not let any *other* threads in!                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Found no NVI

I have a similar issue AssertionError: Torch not compiled with CUDA enabled

gouse-73 · December 12, 2023, 9:08am

Did you guys got any solution for loading model only on CPU with quantization?

vmss2009 · January 28, 2024, 2:39pm

Same here!!! Can anyone help for the workaround ?

gugaio · January 28, 2024, 2:52pm

The flag load_in_8bit is used to enable 8-bit quantization with LLM.int8(). LLM.int8 is a lightweight wrapper around CUDA custom functions, so the quantization is only possible in GPU.

You have the required details in offical bitsandbytes github page.

Requirements: Python >=3.8. Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0.

nielsr · January 28, 2024, 9:00pm

If you want to run inference of quantized LLMs on CPU, it’s recommended to take a look at the llama cpp project: GitHub - ggerganov/llama.cpp: LLM inference in C/C++. This one leverages a new format called GGUF

There’s now also the MLX framework by Apple which allows to run these models on Macbooks: GitHub - ml-explore/mlx: MLX: An array framework for Apple silicon

What you could do is train a model using the Hugging Face tooling (PEFT, TRL, Transformers) and then export your model to the GGUF format: llama.cpp/convert-hf-to-gguf.py at master · ggerganov/llama.cpp · GitHub. You can then run your quantized model on CPU.

umakantk · February 3, 2025, 4:22am

I observed a similar issue and fixed it as below:

I used BitsAndBytesConfig:

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True,
)

And then created the model object as:

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="cpu"
)

You can follow instructions from Installation Guide to install bitsandbytes on Intel CPUs. Below are the commands:

git clone --depth 1 -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git && cd bitsandbytes/
pip install intel_extension_for_pytorch
pip install -r requirements-dev.txt
cmake -DCOMPUTE_BACKEND=cpu -S .
make
pip install -e .   # `-e` for "editable" install, when developing BNB (otherwise leave that out)

Hope it helps!

Topic		Replies	Views
How to load quantized LLM to CPU only device Intermediate	0	1950	January 28, 2024
Load quantized model in memory Beginners	1	594	December 8, 2023
SmolVLM 8bit Quantization Problem Models	3	525	November 29, 2024
An error i ve been trying to fix for days now Intermediate	4	460	November 19, 2024
"normal_kernel_cpu" not implemented for 'Char' when trying to import 8-bit model 🤗Transformers	6	1884	February 23, 2025

Loading quantized model on CPU only

Related topics