How can I make use of GPU manually to run inference faster?

Hello everyone,
I’m trying to run inference using the quantized version of the Mistral 7B model — “TheBloke/Yarn-Mistral-7B-128k-GPTQ” — and create a pipeline:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

MODEL_NAME = "TheBloke/Yarn-Mistral-7B-128k-GPTQ"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MAX_MEMORY = 14

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=False,
    revision="main",
    device_map="auto",
    max_memory={0: f"{MAX_MEMORY}GiB"}
).eval()

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    model_kwargs={
        "temperature": 0.7,
        "repetition_penalty": 1.15,
        "top_p": 0.95,
        "top_k": 40
    }
)

When I run the inference, the model only uses about 4–5 GB of memory, which I assume is due to the quantization. Is there a way to make it utilize the entire available GPU memory during computation?

1 Like

For example,

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=False,
    revision="main",
    #device_map="auto",
    device_map=DEVICE,
    max_memory={0: f"{MAX_MEMORY}GiB"}
).eval()

or

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=False,
    revision="main",
    #device_map="auto",
    max_memory={0: f"{MAX_MEMORY}GiB"}
).eval().to(DEVICE)

I’ve tried that, but max_memory acts as a limit and it is not affecting the utilization during computing.

1 Like

Maybe it’s better not to specify max_memory…

max_memory is handled by accelerate, so as long as you use this, accelerate will be called, which is not very effective if you only want to use a single GPU.

Or you could use something other than “auto”.

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=False,
    revision="main",
    device_map="sequential",
    max_memory={0: f"{MAX_MEMORY}GiB"}
).eval()