Seems newer Transformersâ spec. For smaller models, I think the simplest approach is de-quantization â fine-tuning â quantization (if needed).
Youâre running into a deliberate new restriction in recent transformers versions, not a bug in your code.
At a high level:
- In
transformers>=4.38 (including 4.41.0), the Trainer refuses to train any model that is marked as quantized (is_quantized=True) unless itâs wrapped as a PEFT/adapter model.(GitHub)
- This check fires before looking at which parameters have
requires_grad=True, so even âonly my tiny projector is trainable, the backbone is frozenâ is blocked.
- The reason is conceptual: quantized weights are treated as read-only, inference-only parameters; official guidance is âdonât train them directly, only train extra modules on top (LoRA/adapters/head), or train in full precision and quantize afterward.â(Hugging Face)
So the behavior you had in 4.31.0 was âallowed by omissionâ; the new versions added an explicit guard.
Below:
- What exactly changed and why you get this error now.
- How to train your custom linear projector without LoRA in a supported way.
- How to keep a quantized backbone and still train the projector (manual loop).
- Unsafe / hacky workarounds (if you absolutely need old behavior).
- Pointers to good references.
1. Cause: new Trainer guard for quantized models
1.1 What the error really means
The error text comes from Trainer.__init__ in transformers:
âYou cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more detailsâ(GitHub)
Internally, recent Trainer code does roughly this:(GitHub)
# simplified sketch of what happens inside Trainer.__init__
def _is_peft_model(model):
# True if model is a PEFT PeftModel
...
_is_quantized_and_base_model = getattr(model, "is_quantized", False) and not getattr(
model, "_hf_peft_config_loaded", False
)
# model already loaded here
if _is_quantized_and_base_model and not _is_peft_model(model):
raise ValueError(
"You cannot perform fine-tuning on purely quantized models. "
"Please attach trainable adapters on top of the quantized model..."
)
Key points:
model.is_quantized == True is set when you load with a quantization config (e.g. bitsandbytes 4-bit/8-bit, AWQ, etc.).(Hugging Face)
_hf_peft_config_loaded is set when the model has PEFT adapters attached (e.g. LoRA, other PEFT methods).(GitHub)
_is_peft_model(model) checks if this is actually a PeftModel.
If the model is quantized and not a PEFT model, Trainer raises the ValueError before it cares which parameters are frozen.
Thatâs why âfrozen quantized backbone + tiny trainable projectorâ still hits the error.
1.2 Why this didnât happen in 4.31.0
-
In 4.31.0, this quantization guard either didnât exist or was much weaker.
-
People could do exactly what you did: load a model with load_in_4bit / quantization_config, freeze everything except a small head or projector, and call Trainer. It âworkedâ, but it was never a formally supported path.
-
Starting around late 2023, the HF team explicitly documented that pure 4-bit/8-bit training is not supported, only PEFT adapters on top of quantized weights are supported:
- âIt is not possible to perform pure 4bit training on these models⌠you can train these models by leveraging parameter efficient fine tuning methods (PEFT) and train adapters on top of them.â(Hugging Face)
- âAfter a model is quantized it isnât typically further trained⌠But since PEFT methods only add extra trainable parameters, this allows you to train a quantized model with a PEFT adapter on top.â(Hugging Face)
So:
- Old version: no guard â your setup ran.
- New version: guard added â same code now raises
ValueError.
Your code is not âwrongâ; the library policy changed.
2. Supported solution 1: train projector with an unquantized backbone, then quantize
This is the simplest way to get full gradient training of your tiny projector while staying fully supported and still freezing the LLM.
Idea:
-
During training
- Load the LLM without quantization (fp16/bf16).
- Freeze all LLM parameters.
- Attach your projector (linear layer).
- Use
Trainer normally â only projector parameters have gradients.
-
After training
- Quantize the trained model for inference (backbone + projector) using bitsandbytes / another quantizer.
This completely avoids the âpurely quantized modelâ check, because the model isnât quantized while training.
2.1 Sketch code (no LoRA, only your projector)
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
model_name = "your-llm-here"
# 1. load in bf16/fp16 (not quantized)
backbone = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # or torch.float16 / torch.float32
device_map="auto",
)
# 2. freeze backbone
for p in backbone.parameters():
p.requires_grad = False
# 3. add your projector
hidden_size = backbone.config.hidden_size
proj_dim = 256 # example
backbone.projector = nn.Linear(hidden_size, proj_dim)
# by default, projector params have requires_grad=True
# 4. define a wrapper that uses projector in forward
class BackboneWithProjector(nn.Module):
def __init__(self, model):
super().__init__()
self.model = model
def forward(self, input_ids, attention_mask=None, labels=None):
# example: take last hidden state as features
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
)
hidden = outputs.hidden_states[-1] # [batch, seq, hidden]
features = hidden[:, -1, :] # [batch, hidden]
logits = self.model.projector(features) # [batch, proj_dim]
# you can plug in your task-specific loss here, or
# return logits and compute loss in Trainer's compute_metrics
return {"logits": logits}
model = BackboneWithProjector(backbone).to("cuda")
# 5. training as usual with Trainer
training_args = TrainingArguments(
output_dir="./out",
per_device_train_batch_size=4,
num_train_epochs=3,
fp16=True, # if using fp16
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset, # your dataset
eval_dataset=eval_dataset,
)
trainer.train()
Important points:
- The LLM weights are frozen, so VRAM cost is mostly forward activations, not optimizer states for billions of params.
- Only the projector has gradients; this is exactly âfull fine-tuning of your tiny moduleâ.
- No quantization is involved during training â no
is_quantized flag â no error.
After training finishes, you can quantize the backbone for inference following the standard quantization docs.(Hugging Face)
This is the cleanest and most future-proof solution if you can afford to train with an unquantized backbone (bf16/fp16).
3. Supported solution 2: keep backbone quantized, but train projector via a manual loop
If you really want the memory savings of a quantized backbone during training and still do not want LoRA/adapters, the supported direction is:
- Treat the quantized LLM as a frozen feature extractor.
- Run your own PyTorch training loop that only optimizes your projector.
- Do not use
Trainer, because Trainer is where the âpurely quantized modelâ check lives.(Hugging Face Forums)
Conceptually, this is similar to how QLoRA works: gradients go through a frozen quantized base into small extra weights, except youâre writing the loop yourself and your extra weights are your projector instead of LoRA matrices.(Hugging Face)
3.1 Sketch code (quantized backbone + projector, no Trainer)
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
model_name = "your-llm-here"
# 1. load quantized backbone
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
backbone = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_config,
device_map="auto",
)
# 2. freeze backbone
for p in backbone.parameters():
p.requires_grad = False
device = next(backbone.parameters()).device
# 3. projector in full precision
hidden_size = backbone.config.hidden_size
proj_dim = 256 # example
projector = nn.Linear(hidden_size, proj_dim).to(device)
optimizer = torch.optim.AdamW(projector.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss() # or any loss you need
backbone.eval() # backbone as feature extractor
for epoch in range(num_epochs):
for batch in train_dataloader:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["labels"].to(device)
with torch.no_grad():
outputs = backbone(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
)
hidden = outputs.hidden_states[-1] # [B, T, H]
features = hidden[:, -1, :] # [B, H] (example pooling)
logits = projector(features) # [B, proj_dim]
loss = loss_fn(logits, labels)
optimizer.zero_grad()
loss.backward() # gradients only in projector
optimizer.step()
Again, this gives you:
- Full gradient updates on all parameters of the projector.
- Quantized backbone used in inference mode only; no gradients through its weights.
- No
Trainer and thus no ValueError.
Caveats:
- This is not the officially âblessedâ training pattern for quantized models, because HF documentation strongly encourages PEFT adapters for fine-tuning quantized LLMs.(Hugging Face)
- You must handle logging, evaluation, checkpointing, and distributed training yourself.
But if the constraint is âno LoRA, but Iâm OK writing my own loopâ, this fits that requirement.
4. PEFT / LoRA context (why the error mentions adapters at all)
Even though you donât want LoRA, itâs useful to understand why the library keeps talking about âadaptersâ.
HFâs recommended pattern for quantized training is:(Hugging Face)
- Quantize the large base model (e.g. 4-bit or 8-bit with bitsandbytes).
- Keep the quantized base frozen.
- Add small PEFT adapters (LoRA or other methods) on top.
- Train only those adapters (millions of parameters instead of billions).
- Use
Trainer or SFTTrainer on the resulting PeftModel.
This pattern:
- Respects that quantized weights are difficult to update stably.
- Keeps memory low but training stable (adapters are full precision).
- Is exactly what QLoRA does: âbackprop through a frozen, 4-bit quantized pretrained language model into LoRA adapters.â(Hugging Face)
The Trainer guard you hit is simply enforcing: if is_quantized=True and this is not a PEFT model, then stop.
So:
- If you ever relax your âno LoRAâ requirement, the most supported way to get âtiny trainable module + quantized backbone + Trainerâ is to wrap your projector (or similar) as an adapter with PEFT, so that
Trainer sees a PeftModel.(Hugging Face)
5. Hacky / unsafe workarounds (not recommended, but common)
People sometimes bypass the guard when theyâre sure theyâre only training new full-precision modules and not touching quantized base weights. Typical hacks (seen in various issues and projects)
fossies.org)
-
Manually flip flags on the model before passing it to Trainer:
model.is_quantized = False
- or
setattr(model, "_hf_peft_config_loaded", True)
This tricks _is_quantized_and_base_model into being false.
-
Wrap the quantized model inside your own nn.Module shell that does not expose is_quantized at all, and pass that wrapper to Trainer.
-
Pin Transformers to â¤4.31.0 or another pre-guard version so the error never fires.(CSDN)
These approaches:
- Can work in practice.
- Are brittle: future releases may add different checks or rely more on
is_quantized.
- Put you outside what HF considers supported behavior (especially if any gradients ever touch quantized weights).
If you use them, make absolutely sure:
- Only your projectorâs parameters have
requires_grad=True.
- You have sanity checks (loss curves, validation metrics) to make sure training behaves sensibly.
6. Short reading list (good references with context)
These links reinforce the explanations above and are good to skim:
6.1 Exact error + Trainer behavior
-
Transformers GitHub issue â âIssue with Fine-tuning LLM for Classificationâ
Shows the exact same error when trying to fine-tune a quantized Mistral model with Trainer; includes the code path in trainer.py that raises it.(GitHub)
-
HF forum â âResolving âCannot Perform Fine-Tuning on Purely Quantized Modelsââ
Reproduces the error with Falcon, and the answer says clearly: you must attach adapters (PEFT/LoRA) or avoid fine-tuning a purely quantized model.(Hugging Face Forums)
-
StackOverflow â âValueError: You cannot perform fine-tuning on purely quantized modelsâ
Q&A that explains the guard and suggests PEFT or custom loops as alternatives.(Stack Overflow)
6.2 Official docs on quantization + training
-
HF blog â 4-bit transformers with bitsandbytes and QLoRA
Explicitly: âIt is not possible to perform pure 4bit training on these models⌠you can train these models by leveraging parameter efficient fine tuning methods (PEFT).â(Hugging Face)
-
PEFT quantization guide
Explains that quantized models are not usually further trained; PEFT lets you train extra parameters on top of a frozen quantized base.(Hugging Face)
-
Accelerate quantization guide â fine-tune a quantized model
Repeats the same point for 4-bit and 8-bit: pure training is not supported; use PEFT adapters.(Hugging Face)
6.3 Broader context on QLoRA / PEFT
-
Overview of quantization schemes in Transformers
Summarizes bitsandbytes, GPTQ, AWQ, etc., and notes limitations of training on quantized weights.(Hugging Face)
-
QLoRA / 4-bit fine-tuning tutorials and repos (e.g., official QLoRA repo)
Show the canonical âfrozen 4-bit base + LoRA adaptersâ training loop.(Hugging Face)
Final recap
-
The cause of your error is the new Trainer guard that forbids training any quantized base model unless it is wrapped as a PEFT model. It doesnât care that only your projector has gradients.
-
To get full gradient training of your custom projector without LoRA:
- Easiest: train with an unquantized (bf16/fp16) backbone, frozen, using
Trainer; then quantize for inference.
- If you must keep the backbone quantized during training: write a manual PyTorch training loop that uses the quantized LLM as a frozen feature extractor and only updates your projector.
-
LoRA/PEFT is the officially endorsed way to fine-tune on quantized models; your non-LoRA setup is slightly off the main road, so you either stay unquantized while training or bypass Trainer for the projector-only optimization.