Is it possible to run a model using gr.load() and then use it locally?

So, I really liked the model in models/Qwen/Qwen2.5-72B-Instruct. I liked it so much that I even bought pro to get started with it but the api inference keeps hitting the rate limit and giving me issues, is there a way to just locally instead especially through gr.load() since, that works perfectly for my usecase

1 Like

Not all models are possible, but Qwen’s seems to have the Serverless Inference API enabled, so the following method can be used.
If it says Warm, assume it is usable.
Note that if you forget to pass a token, you can only make one request per hour.

If you will be in the Pro subscription, you may also want to use HuggingChat, a Pro benefit. It has recently added a tool feature that makes it even more powerful.

Can you please tell me the code for the same?
Was using this
gr.load("models/Qwen/Qwen2.5-72B-Instruct").launch()
to create a space so, how to use it locally? Currently, I was using the inference API but the speeds a bit slow due to latency issues I guess.

You want to use the GUI from Gradio. I’ll go look for some.

Edit:
It would be good to duplicate either of these. In any case, the token must be set to an environment variable. The way to set environment variables are different depends on the operating system.

No, I don’t want the GUI thats the issue. I want a way to use the model for inference using my code without the api like a for loop to translate 100 sentences on qwen but without the api and right from my code. Sorry! I didn’t explain my issue properly.

I see…that way. I think the free API can do about 300 requests per hour, but if you run it locally, it’s unlimited.
You would need a huge amount of VRAM instead, but if you have VRAM, I can tell you the means.
The amount of VRAM needed is at least a little over 20 GB with 4-bit quantization. If you don’t, you’ll need about 80 GB.
How much VRAM capacity can you provide?

I have access to A6000 right now so 48Gigs of VRAM and 40Gigs of GPU

1 Like

That much would be enough for a quantized version! The code below should work; Qwen 72B is not a gated model, so you don’t need a token, and you only need to download (automatically) it the first time.

pip install -U transformers bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "Qwen/Qwen2.5-72B-Instruct"

nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                                bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)


model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16,
                                             device_map=device, quantization_config=nf4_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

The automatic download destination is the HF cache folder, so if you want to put the model on a drive other than the system drive, you should set the environment variable for the cache folder to that drive in advance. It is very difficult to move them later.

Ok Thanks a lot! Giving this a try

1 Like

If the above code works, then you can do most of the rest by just changing the code around the bottom, referring to the HF page.

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Welp it failed on CPU

---------------------------------------------------------------------------
TorchRuntimeError                         Traceback (most recent call last)
Cell In[6], line 13
      7 model_name = "Qwen/Qwen2.5-72B-Instruct"
      9 nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
     10                                 bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
---> 13 model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16,
     14                                              device_map=device, quantization_config=nf4_config)
     15 tokenizer = AutoTokenizer.from_pretrained(model_name)
     17 prompt = "Give me a short introduction to large language model."

File /usr/local/lib/python3.11/dist-packages/transformers/models/auto/auto_factory.py:564, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    562 elif type(config) in cls._model_mapping.keys():
    563     model_class = _get_model_class(config, cls._model_mapping)
--> 564     return model_class.from_pretrained(
    565         pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
    566     )
    567 raise ValueError(
    568     f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    569     f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
    570 )

File /usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py:4014, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   4004     if dtype_orig is not None:
   4005         torch.set_default_dtype(dtype_orig)
   4007     (
   4008         model,
   4009         missing_keys,
   4010         unexpected_keys,
   4011         mismatched_keys,
   4012         offload_index,
   4013         error_msgs,
-> 4014     ) = cls._load_pretrained_model(
   4015         model,
   4016         state_dict,
   4017         loaded_state_dict_keys,  # XXX: rename?
   4018         resolved_archive_file,
   4019         pretrained_model_name_or_path,
   4020         ignore_mismatched_sizes=ignore_mismatched_sizes,
   4021         sharded_metadata=sharded_metadata,
   4022         _fast_init=_fast_init,
   4023         low_cpu_mem_usage=low_cpu_mem_usage,
   4024         device_map=device_map,
   4025         offload_folder=offload_folder,
   4026         offload_state_dict=offload_state_dict,
   4027         dtype=torch_dtype,
   4028         hf_quantizer=hf_quantizer,
   4029         keep_in_fp32_modules=keep_in_fp32_modules,
   4030         gguf_path=gguf_path,
   4031     )
   4033 # make sure token embedding weights are still tied if needed
   4034 model.tie_weights()

On using GPU Bitsandbytes is raising an Attribute issue, since, apparently it’s not compiled on cuda

I think it’s the torch. The biggest obstacle in setting up a locally generated AI environment. Well, it’s not that it’s difficult, just troublesome…

I think I need to use this articleArticle as I am using a remote linux based machine

Never done this so a bit unwilling although I know its a one time procedure to avoid future discomfort

If you can find a CUDA bundled version (about 4GB), you may be able to skip a lot of steps. The following page talks about Windows, but there are of course Linux versions as well.
Also, if possible, it would be easier to manage the version if you put it in a virtual environment.
Anyway, you can’t avoid this one, it’s like half of the main body of AI.