How to run Trainer + vLLM on Quantized LLMs?

AllIllusion · March 31, 2025, 2:05pm

Hi everyone, I am a beginner of the Mini-R1, trying to play the Countdown task with GRPOTrainer and vLLM, however, always fail whenever applying quantization.

The code works well for:

accelerate + DeepSpeed + + Qwen2.5 + vLLM
accelerate + + PEFT + + Qwen2.5 + vLLM
accelerate + + PEFT + Quantize 4bit + Qwen2.5

When “accelerate + PEFT + Quantize 4bi + Qwen2.5 + vLLM”, I always get error:
[rank0]: File "/opt/Miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1008, in weight_loader [rank0]: assert param_data.shape == loaded_weight.shape

Anyone can help me with it, please?

Any tutorial to make SFTTrainer with vLLM on Quantized LLMs, please?

John6666 · March 31, 2025, 2:46pm

It seems it doesn’t supported for now…

github.com/huggingface/trl

GRPO Trainer does not support quantized models with vLLM

opened 03:34PM - 20 Mar 25 UTC

VanderpoelLiam

✨ enhancement 🏋 GRPO

### Feature request I saw another issue #3054 mentioning how the GRPO Config do…es not support all vLLM parameters. The specific issue I am having is that when I want to a quantized model like `unsloth/Qwen2.5-7B-Instruct-bnb-4bit` I am forced to edit the code in the GRPO Trainer where we load the model with vLLM to this: ``` self.llm = LLM( model=model.name_or_path, device=vllm_device, gpu_memory_utilization=self.args.vllm_gpu_memory_utilization, dtype=self.args.vllm_dtype, # Automatic Prefix Caching caches the KV cache of existing queries, so that a new query can # directly reuse the KV cache if it shares the same prefix with one of the existing queries. # This is particularly useful here because we generate completions from the same prompts. enable_prefix_caching=self.args.vllm_enable_prefix_caching, max_model_len=self.args.vllm_max_model_len, quantization="bitsandbytes", load_format="bitsandbytes", ) ``` per this [unsloth issue](https://github.com/unslothai/unsloth/issues/960#issuecomment-2395887281) otherwise I get `KeyError: 'layers.0.mlp.down_proj.weight'`. I saw [#2728 comment](https://github.com/huggingface/trl/pull/2728#issuecomment-2635166424) so maybe something similar could be done here where we pass some quantization config to vLLM. ### Motivation I want to be able to use quantized models with vLLM and GRPO without modifying the trainer myself, but rather by passing in some configs. ### Your contribution Add the arguments `vllm_quantization` and `vllm_load_format` similarly to [this pr](https://github.com/huggingface/trl/pull/2728/files). I saw some discussion in the same PR suggesting `vllm_init_kwargs` e.g. [this comment](https://github.com/huggingface/trl/pull/2728#issuecomment-2629328181) but I do not see it in the merged PR. This approach seems fine as well.

And GRPO vLLM related issues:

github.com/huggingface/trl

How to support multi-device VLLM inference in the GRPO Trainer

opened 09:24AM - 21 Feb 25 UTC

0x404

✨ enhancement 🏋 GRPO

https://github.com/huggingface/trl/blob/e5ae703d352b29537159180087ef8bd4b41bf625…/trl/trainer/grpo_trainer.py#L439-L461 In the current GRPO implementation, VLLM can only run on a single GPU, which becomes a performance bottleneck. For example, in an 8-GPU setup, the remaining 7 GPUs have to wait for 1 GPU to complete inference, and it also can't accommodate larger models. How can we enable VLLM to run on multiple GPUs? The only concern is that we need to figure out a way to update the parameters across multiple GPUs each time the model is reloaded: https://github.com/huggingface/trl/blob/e5ae703d352b29537159180087ef8bd4b41bf625/trl/trainer/grpo_trainer.py#L624-L653

github.com/huggingface/trl

Multi-GPU sampling for vLLM in GRPO Trainer

opened 08:09PM - 30 Jan 25 UTC

nch0w

✨ enhancement 🏋 GRPO

### Feature request It seems that the vLLM device can only be set in [GRPOConfi…g.vllm_device](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_config.py#L130), which is a string corresponding to a CUDA device identifier. I think this implies that the vLLM device can only use a single GPU, which can be a bottleneck for RL. It is also possible to use a subset by setting the CUDA_VISIBLE_DEVICE environment variable, but this might break TRL. Is there a more convenient way to specify multiple GPUs in a single node for training (or any hacks that would work now)? Furthermore, there might need to be more detailed configurations for multi-node vLLM/GPRO training runs. ### Motivation Enhance training efficiency for RL with >single GPU sampling. ### Your contribution N/A

AllIllusion · March 31, 2025, 3:03pm

Any tutorial teaching how to modify the configuration file can also help, please.

I saw that the vLLM official website said they support Quantized LLM with PEFT, however, wasn’t able to find any tutorial teaching how to modify the existing Trainer.

Even not with GRPOTrainer, any tutorial teaching how to make the SFTTrainer work can also help, please

John6666 · March 31, 2025, 3:32pm

I can’t find a tutorial or reference either…
Is it possible that the only way to train is with Transformers or other libraries…?
If you’re looking for speed, you could check out unsloth’s training tools.

AllIllusion · March 31, 2025, 6:12pm

Unfortunately, unsloth does not seem to offer free service for DDP.

Topic		Replies	Views
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	1606	September 26, 2024
How to load quantized LLM to CPU only device Intermediate	0	1923	January 28, 2024
Hugging Face Trainer class with accelerate 🤗Accelerate	2	388	May 21, 2024
Peft model from pretrained load in 8/4 bit 🤗Transformers	6	17485	October 12, 2023
Attempting to unscale FP16 gradients 🤗Transformers	3	7832	June 10, 2024

How to run Trainer + vLLM on Quantized LLMs?

Related topics