Llama32-11b inferencing took 6 minutes to answer

Dears.
I am using HF spaces paid plan (ZeroGPU) with persistent storage.
I am running below standard code to load “meta-llama/Llama-3.2-11B-Vision-Instruct” and its processer.
It took 6minutes to get answer on very simple question “who is Donald Trump?” and I tried it many times – it took to much time.
but the existing SPACES that use “meta-llama/Llama-3.2-11B-Vision-Instruct” developed by the community takes 2-3 seconds . what is the problem please

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

import os
from huggingface_hub import HfApi

Specify the new cache directory

cache_dir = “./hub”

model_id = “meta-llama/Llama-3.2-11B-Vision-Instruct”

Load the model with the custom cache directory

model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map=“auto”,
cache_dir=cache_dir # Set custom cache directory
)

Load the processor with the custom cache directory

processor = AutoProcessor.from_pretrained(
model_id,
cache_dir=cache_dir # Set custom cache directory
)
question = “who is donald trump?”

Process the text input using the processor

inputs = processor(text=question, return_tensors=“pt”).to(model.device)

Generate the response from the model

output = model.generate(**inputs, max_new_tokens=350) # Set max_new_tokens to limit response length

Decode the model’s output into human-readable text

decoded_output = processor.decode(output[0], skip_special_tokens=True)

Remove the question part if it appears in the output by splitting the response

if question in decoded_output:
response = decoded_output.split(question)[-1].strip()
else:
response = decoded_output

Print the clean response (answer only)

print(response)

1 Like

The short answer is that you are not using the GPUs in the Zero GPU space, since they are only enabled the moment you explicitly enable them.
I can elaborate a bit more on this if you don’t mind if I take a look at the entire source of the space.

Thanks John.
How I can give you access?

I have just added a GPU to my space , see below image…
will this enable the GPU on my SPACE…

1 Like

@John6666
as stated in previous image, and even after specifying that MY hardware is using GPU , I tested if GPU is used or not - the answer was
“GPU is not available. The model will use the CPU.”

Any help please… I Am really wasting my time…I could not add new code to finish my task…just stuck in GPU issue.

Check if GPU is available and print message

if torch.cuda.is_available():
print(“GPU is available. The model will use the GPU.”)
else:
print(“GPU is not available. The model will use the CPU.”)

I think I may have found the cause, the T4 has 16GB of VRAM which is not enough for the 11B model, the Zero GPU space can use 40GB of VRAM for a moment so it works with the 11B model.
With 16GB, a 4-bit quantized model should work, so maybe you could try that.
There is also a way to quantize a regular model at load time.

@John6666

No John – “GPU is not available. The model will use the CPU.” is related to UNAVIALBILITY of CUDA library.
I could solve the problem on my local laptop by installing needed pytorch and cuda libraries…

but for HF spaces, these CUDA libraries should be installed by default on the VM/Container…I dont have access to the VM/Container to install such files…

can you share it with other huggingface people who can help

Regards,
Omran

1 Like

Sorry, I forgot to explain the basic use of the Zero GPU space.
Zero GPU space makes CUDA invisible except to functions with @spaces.GPU decorators or global scope.