Run Mistral model only on CPU

Hey there,
unfortunately I do not have a GPU. I want to run the mistral code just on my CPU. I have a RAG project and all works just fine except the following mistral part:

from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
import torch
import faiss   # used for indexing pip install faiss-cpu
from transformers import (RagRetriever,
                          RagSequenceForGeneration,
                          RagTokenizer)
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
.....
print("Mistral Models")
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
tokenizer_2 = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

print("HUUUHUUUU")
# # # removed cude here:                                              device_map="cuda:0",
model_2 = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             trust_remote_code=False,
                                             revision="gptq-4bit-32g-actorder_True")
# Save the Mistral model and tokenizer
print("saving mistral now....")
model_2.save_pretrained("mistralModel")
tokenizer_2.save_pretrained("mistralTokenizer")

The error I get is:

.....
Mistral Models
HUUUHUUUU
CUDA extension not installed.
CUDA extension not installed.
Traceback (most recent call last):
  File ".....\test.py", line 263, in <module>
    model_2 = AutoModelForCausalLM.from_pretrained(model_name_or_path,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".....\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".....\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3928, in from_pretrained
    model = quantizer.post_init_model(model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".....Python311\Lib\site-packages\optimum\gptq\quantizer.py", line 587, in post_init_model
    raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization 
config object

I tried already many diffrent things but I could not fix this issue I would be pleased by any kind of help.
Thx Markus

1 Like