Model Shards Checkpoint GeForce 4070 TI 12GB

Hi

I am using Mistral 7B v1 with quantization.

MODEL_NAME = “mistralai/Mistral-7B-v0.1”

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16,
trust_remote_code=True,
device_map=device,
quantization_config=quantization_config
)

When the model is in memory it uses about 8 GB of GPU mem. Nevertheless, when using the model, it loads the checkpoint shards and to do this my desktop needs to be online which I don’t understand.

Any reference I could consult to understand why?

Cheers,

Aldertom