So, I really liked the model in models/Qwen/Qwen2.5-72B-Instruct. I liked it so much that I even bought pro to get started with it but the api inference keeps hitting the rate limit and giving me issues, is there a way to just locally instead especially through gr.load() since, that works perfectly for my usecase
Not all models are possible, but Qwen’s seems to have the Serverless Inference API enabled, so the following method can be used.
If it says Warm, assume it is usable.
Note that if you forget to pass a token, you can only make one request per hour.
If you will be in the Pro subscription, you may also want to use HuggingChat, a Pro benefit. It has recently added a tool feature that makes it even more powerful.
Can you please tell me the code for the same?
Was using this
gr.load("models/Qwen/Qwen2.5-72B-Instruct").launch()
to create a space so, how to use it locally? Currently, I was using the inference API but the speeds a bit slow due to latency issues I guess.
You want to use the GUI from Gradio. I’ll go look for some.
Edit:
It would be good to duplicate either of these. In any case, the token must be set to an environment variable. The way to set environment variables are different depends on the operating system.
No, I don’t want the GUI thats the issue. I want a way to use the model for inference using my code without the api like a for loop to translate 100 sentences on qwen but without the api and right from my code. Sorry! I didn’t explain my issue properly.
I see…that way. I think the free API can do about 300 requests per hour, but if you run it locally, it’s unlimited.
You would need a huge amount of VRAM instead, but if you have VRAM, I can tell you the means.
The amount of VRAM needed is at least a little over 20 GB with 4-bit quantization. If you don’t, you’ll need about 80 GB.
How much VRAM capacity can you provide?
I have access to A6000 right now so 48Gigs of VRAM and 40Gigs of GPU
That much would be enough for a quantized version! The code below should work; Qwen 72B is not a gated model, so you don’t need a token, and you only need to download (automatically) it the first time.
pip install -U transformers bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "Qwen/Qwen2.5-72B-Instruct"
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16,
device_map=device, quantization_config=nf4_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
The automatic download destination is the HF cache folder, so if you want to put the model on a drive other than the system drive, you should set the environment variable for the cache folder to that drive in advance. It is very difficult to move them later.
Ok Thanks a lot! Giving this a try
If the above code works, then you can do most of the rest by just changing the code around the bottom, referring to the HF page.
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Welp it failed on CPU
---------------------------------------------------------------------------
TorchRuntimeError Traceback (most recent call last)
Cell In[6], line 13
7 model_name = "Qwen/Qwen2.5-72B-Instruct"
9 nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
10 bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)
---> 13 model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16,
14 device_map=device, quantization_config=nf4_config)
15 tokenizer = AutoTokenizer.from_pretrained(model_name)
17 prompt = "Give me a short introduction to large language model."
File /usr/local/lib/python3.11/dist-packages/transformers/models/auto/auto_factory.py:564, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
562 elif type(config) in cls._model_mapping.keys():
563 model_class = _get_model_class(config, cls._model_mapping)
--> 564 return model_class.from_pretrained(
565 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
566 )
567 raise ValueError(
568 f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
569 f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."
570 )
File /usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py:4014, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
4004 if dtype_orig is not None:
4005 torch.set_default_dtype(dtype_orig)
4007 (
4008 model,
4009 missing_keys,
4010 unexpected_keys,
4011 mismatched_keys,
4012 offload_index,
4013 error_msgs,
-> 4014 ) = cls._load_pretrained_model(
4015 model,
4016 state_dict,
4017 loaded_state_dict_keys, # XXX: rename?
4018 resolved_archive_file,
4019 pretrained_model_name_or_path,
4020 ignore_mismatched_sizes=ignore_mismatched_sizes,
4021 sharded_metadata=sharded_metadata,
4022 _fast_init=_fast_init,
4023 low_cpu_mem_usage=low_cpu_mem_usage,
4024 device_map=device_map,
4025 offload_folder=offload_folder,
4026 offload_state_dict=offload_state_dict,
4027 dtype=torch_dtype,
4028 hf_quantizer=hf_quantizer,
4029 keep_in_fp32_modules=keep_in_fp32_modules,
4030 gguf_path=gguf_path,
4031 )
4033 # make sure token embedding weights are still tied if needed
4034 model.tie_weights()
On using GPU Bitsandbytes is raising an Attribute issue, since, apparently it’s not compiled on cuda
I think it’s the torch. The biggest obstacle in setting up a locally generated AI environment. Well, it’s not that it’s difficult, just troublesome…
I think I need to use this articleArticle as I am using a remote linux based machine
Never done this so a bit unwilling although I know its a one time procedure to avoid future discomfort
If you can find a CUDA bundled version (about 4GB), you may be able to skip a lot of steps. The following page talks about Windows, but there are of course Linux versions as well.
Also, if possible, it would be easier to manage the version if you put it in a virtual environment.
Anyway, you can’t avoid this one, it’s like half of the main body of AI.
Thanks for information about run a model. Your Information are very helpful for me.
You Like the model and I Liked very much. Thanks for sharing.
It sounds like you’re experiencing rate limit issues with the API. To run the model locally with Gradio, you can try loading the model using gr.load()
in your script.