Hello everyone,
I’m trying to run inference using the quantized version of the Mistral 7B model — “TheBloke/Yarn-Mistral-7B-128k-GPTQ” — and create a pipeline:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
MODEL_NAME = "TheBloke/Yarn-Mistral-7B-128k-GPTQ"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MAX_MEMORY = 14
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
trust_remote_code=False,
revision="main",
device_map="auto",
max_memory={0: f"{MAX_MEMORY}GiB"}
).eval()
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.float16,
model_kwargs={
"temperature": 0.7,
"repetition_penalty": 1.15,
"top_p": 0.95,
"top_k": 40
}
)
When I run the inference, the model only uses about 4–5 GB of memory, which I assume is due to the quantization. Is there a way to make it utilize the entire available GPU memory during computation?