How to run Trainer + vLLM on Quantized LLMs?

Hi everyone, I am a beginner of the Mini-R1, trying to play the Countdown task with GRPOTrainer and vLLM, however, always fail whenever applying quantization.

The code works well for:

accelerate + DeepSpeed + + Qwen2.5 + vLLM
accelerate + + PEFT + + Qwen2.5 + vLLM
accelerate + + PEFT + Quantize 4bit + Qwen2.5

When “accelerate + PEFT + Quantize 4bi + Qwen2.5 + vLLM”, I always get error:
[rank0]: File "/opt/Miniconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1008, in weight_loader [rank0]: assert param_data.shape == loaded_weight.shape

Anyone can help me with it, please?

Any tutorial to make SFTTrainer with vLLM on Quantized LLMs, please?

1 Like

It seems it doesn’t supported for now…

And GRPO vLLM related issues:

2 Likes

Any tutorial teaching how to modify the configuration file can also help, please.

I saw that the vLLM official website said they support Quantized LLM with PEFT, however, wasn’t able to find any tutorial teaching how to modify the existing Trainer. :sweat_smile:

Even not with GRPOTrainer, any tutorial teaching how to make the SFTTrainer work can also help, please

2 Likes

I can’t find a tutorial or reference either…:thinking:
Is it possible that the only way to train is with Transformers or other libraries…?
If you’re looking for speed, you could check out unsloth’s training tools.

Unfortunately, unsloth does not seem to offer free service for DDP. :melting_face:

1 Like

As I stumbled upon this post twice now, so wanted to share another source I found. From the vLLM documentation it seems this should be possible now.

However, I am also still figuring this out so would love to get other reference implementations :slight_smile:
From what I gather you can either 1) use a HF model_name for an existing model that is already quantized (e.g. model_id = "unsloth/tinyllama-bnb-4bit"))or 2) update vLLM package and use it as a parameter, but I think you will have to modify your TRL package a bit to support this argument.

import torch
model_id = "huggyllama/llama-7b"
llm = LLM(
    model=model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization="bitsandbytes",
)
1 Like