How can I make use of GPU manually to run inference faster?

SrinathKishore · April 22, 2025, 5:01am

Hello everyone,
I’m trying to run inference using the quantized version of the Mistral 7B model — “TheBloke/Yarn-Mistral-7B-128k-GPTQ” — and create a pipeline:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

MODEL_NAME = "TheBloke/Yarn-Mistral-7B-128k-GPTQ"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MAX_MEMORY = 14

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=False,
    revision="main",
    device_map="auto",
    max_memory={0: f"{MAX_MEMORY}GiB"}
).eval()

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    model_kwargs={
        "temperature": 0.7,
        "repetition_penalty": 1.15,
        "top_p": 0.95,
        "top_k": 40
    }
)

When I run the inference, the model only uses about 4–5 GB of memory, which I assume is due to the quantization. Is there a way to make it utilize the entire available GPU memory during computation?

John6666 · April 22, 2025, 5:57am

For example,

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=False,
    revision="main",
    #device_map="auto",
    device_map=DEVICE,
    max_memory={0: f"{MAX_MEMORY}GiB"}
).eval()

or

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=False,
    revision="main",
    #device_map="auto",
    max_memory={0: f"{MAX_MEMORY}GiB"}
).eval().to(DEVICE)

SrinathKishore · April 22, 2025, 6:38am

I’ve tried that, but max_memory acts as a limit and it is not affecting the utilization during computing.

John6666 · April 22, 2025, 6:49am

Maybe it’s better not to specify max_memory…

max_memory is handled by accelerate, so as long as you use this, accelerate will be called, which is not very effective if you only want to use a single GPU.

Or you could use something other than “auto”.

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=False,
    revision="main",
    device_map="sequential",
    max_memory={0: f"{MAX_MEMORY}GiB"}
).eval()

Topic		Replies	Views
Running Mistral-7B-Instruct-v0.2 on multiple GPUs Beginners	4	4278	March 13, 2024
GPU memory GPTJ inference 🤗Transformers	0	237	June 13, 2023
CUDA OUT OF MEMORY on MULTI GPU 🤗Transformers	0	713	February 28, 2024
Run Mistral model only on CPU 🤗Tokenizers	0	1629	March 6, 2024
Unfreed GPU memory after inference using AutoTokenizer Beginners	1	715	March 29, 2024

How can I make use of GPU manually to run inference faster?

Related topics