Hi
I am using Mistral 7B v1 with quantization.
MODEL_NAME = “mistralai/Mistral-7B-v0.1”
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16,
trust_remote_code=True,
device_map=device,
quantization_config=quantization_config
)
When the model is in memory it uses about 8 GB of GPU mem. Nevertheless, when using the model, it loads the checkpoint shards and to do this my desktop needs to be online which I don’t understand.
Any reference I could consult to understand why?
Cheers,
Aldertom