Model Shards Checkpoint GeForce 4070 TI 12GB

aldertom · February 20, 2024, 1:39pm

Hi

I am using Mistral 7B v1 with quantization.

MODEL_NAME = “mistralai/Mistral-7B-v0.1”

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16,
trust_remote_code=True,
device_map=device,
quantization_config=quantization_config
)

When the model is in memory it uses about 8 GB of GPU mem. Nevertheless, when using the model, it loads the checkpoint shards and to do this my desktop needs to be online which I don’t understand.

Any reference I could consult to understand why?

Cheers,

Aldertom

Topic		Replies	Views
CUDA OUT OF MEMORY on MULTI GPU 🤗Transformers	0	714	February 28, 2024
Inference mistral-7b instruct fully offline in Local machin Beginners	0	466	April 27, 2024
How can I make use of GPU manually to run inference faster? 🤗Transformers	3	34	April 22, 2025
torch.nn.DataParallel Mistral-7B-Instruct RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! Beginners	1	64	August 20, 2024
Running Mistral-7B-Instruct-v0.2 on multiple GPUs Beginners	4	4304	March 13, 2024

Model Shards Checkpoint GeForce 4070 TI 12GB

Related topics