Hi everyone, I am a beginner of the Mini-R1, trying to play the Countdown task with GRPOTrainer and vLLM, however, always fail whenever applying quantization.
The code works well for:
accelerate + DeepSpeed + + Qwen2.5 + vLLM
accelerate + + PEFT + + Qwen2.5 + vLLM
accelerate + + PEFT + Quantize 4bit + Qwen2.5
When “accelerate + PEFT + Quantize 4bi + Qwen2.5 + vLLM”, I always get error:
[rank0]: File "/opt/Miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1008, in weight_loader [rank0]: assert param_data.shape == loaded_weight.shape
Anyone can help me with it, please?
Any tutorial to make SFTTrainer with vLLM on Quantized LLMs, please?
1 Like
It seems it doesn’t supported for now…
opened 03:34PM - 20 Mar 25 UTC
✨ enhancement
🏋 GRPO
### Feature request
I saw another issue #3054 mentioning how the GRPO Config do… es not support all vLLM parameters. The specific issue I am having is that when I want to a quantized model like `unsloth/Qwen2.5-7B-Instruct-bnb-4bit` I am forced to edit the code in the GRPO Trainer where we load the model with vLLM to this:
```
self.llm = LLM(
model=model.name_or_path,
device=vllm_device,
gpu_memory_utilization=self.args.vllm_gpu_memory_utilization,
dtype=self.args.vllm_dtype,
# Automatic Prefix Caching caches the KV cache of existing queries, so that a new query can
# directly reuse the KV cache if it shares the same prefix with one of the existing queries.
# This is particularly useful here because we generate completions from the same prompts.
enable_prefix_caching=self.args.vllm_enable_prefix_caching,
max_model_len=self.args.vllm_max_model_len,
quantization="bitsandbytes",
load_format="bitsandbytes",
)
```
per this [unsloth issue](https://github.com/unslothai/unsloth/issues/960#issuecomment-2395887281) otherwise I get `KeyError: 'layers.0.mlp.down_proj.weight'`. I saw [#2728 comment](https://github.com/huggingface/trl/pull/2728#issuecomment-2635166424) so maybe something similar could be done here where we pass some quantization config to vLLM.
### Motivation
I want to be able to use quantized models with vLLM and GRPO without modifying the trainer myself, but rather by passing in some configs.
### Your contribution
Add the arguments `vllm_quantization` and `vllm_load_format` similarly to [this pr](https://github.com/huggingface/trl/pull/2728/files). I saw some discussion in the same PR suggesting `vllm_init_kwargs` e.g. [this comment](https://github.com/huggingface/trl/pull/2728#issuecomment-2629328181) but I do not see it in the merged PR. This approach seems fine as well.
And GRPO vLLM related issues:
opened 09:24AM - 21 Feb 25 UTC
✨ enhancement
🏋 GRPO
https://github.com/huggingface/trl/blob/e5ae703d352b29537159180087ef8bd4b41bf625… /trl/trainer/grpo_trainer.py#L439-L461
In the current GRPO implementation, VLLM can only run on a single GPU, which becomes a performance bottleneck. For example, in an 8-GPU setup, the remaining 7 GPUs have to wait for 1 GPU to complete inference, and it also can't accommodate larger models.
How can we enable VLLM to run on multiple GPUs? The only concern is that we need to figure out a way to update the parameters across multiple GPUs each time the model is reloaded:
https://github.com/huggingface/trl/blob/e5ae703d352b29537159180087ef8bd4b41bf625/trl/trainer/grpo_trainer.py#L624-L653
opened 08:09PM - 30 Jan 25 UTC
✨ enhancement
🏋 GRPO
### Feature request
It seems that the vLLM device can only be set in [GRPOConfi… g.vllm_device](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_config.py#L130), which is a string corresponding to a CUDA device identifier. I think this implies that the vLLM device can only use a single GPU, which can be a bottleneck for RL. It is also possible to use a subset by setting the CUDA_VISIBLE_DEVICE environment variable, but this might break TRL. Is there a more convenient way to specify multiple GPUs in a single node for training (or any hacks that would work now)? Furthermore, there might need to be more detailed configurations for multi-node vLLM/GPRO training runs.
### Motivation
Enhance training efficiency for RL with >single GPU sampling.
### Your contribution
N/A
2 Likes
Any tutorial teaching how to modify the configuration file can also help, please.
I saw that the vLLM official website said they support Quantized LLM with PEFT, however, wasn’t able to find any tutorial teaching how to modify the existing Trainer.
Even not with GRPOTrainer, any tutorial teaching how to make the SFTTrainer work can also help, please
2 Likes
I can’t find a tutorial or reference either…
Is it possible that the only way to train is with Transformers or other libraries…?
If you’re looking for speed, you could check out unsloth’s training tools.
Unfortunately, unsloth does not seem to offer free service for DDP.
1 Like