Fine-tuning a custom module but do not use LoRA

I have a custom module (a linear layer as a projector).

In the old version of transformers (4.31.0), I can fine-tuning this module and freeze the backbone LLM.

However, when I update the transformers to a new version (4.41.0), I am not allowed to do that, and the program raises the error:

ValueError: You cannot perform fine-tuning on purely quantized models.

I do not need to use LoRA to train a tiny module, how can I do the Full fine-tuning on my tiny module?

Thanks

1 Like

Seems newer Transformers’ spec. For smaller models, I think the simplest approach is de-quantization → fine-tuning → quantization (if needed).


You’re running into a deliberate new restriction in recent transformers versions, not a bug in your code.

At a high level:

  • In transformers>=4.38 (including 4.41.0), the Trainer refuses to train any model that is marked as quantized (is_quantized=True) unless it’s wrapped as a PEFT/adapter model.(GitHub)
  • This check fires before looking at which parameters have requires_grad=True, so even “only my tiny projector is trainable, the backbone is frozen” is blocked.
  • The reason is conceptual: quantized weights are treated as read-only, inference-only parameters; official guidance is “don’t train them directly, only train extra modules on top (LoRA/adapters/head), or train in full precision and quantize afterward.”(Hugging Face)

So the behavior you had in 4.31.0 was “allowed by omission”; the new versions added an explicit guard.

Below:

  1. What exactly changed and why you get this error now.
  2. How to train your custom linear projector without LoRA in a supported way.
  3. How to keep a quantized backbone and still train the projector (manual loop).
  4. Unsafe / hacky workarounds (if you absolutely need old behavior).
  5. Pointers to good references.

1. Cause: new Trainer guard for quantized models

1.1 What the error really means

The error text comes from Trainer.__init__ in transformers:

“You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details”(GitHub)

Internally, recent Trainer code does roughly this:(GitHub)

# simplified sketch of what happens inside Trainer.__init__
def _is_peft_model(model):
    # True if model is a PEFT PeftModel
    ...

_is_quantized_and_base_model = getattr(model, "is_quantized", False) and not getattr(
    model, "_hf_peft_config_loaded", False
)

# model already loaded here
if _is_quantized_and_base_model and not _is_peft_model(model):
    raise ValueError(
        "You cannot perform fine-tuning on purely quantized models. "
        "Please attach trainable adapters on top of the quantized model..."
    )

Key points:

  • model.is_quantized == True is set when you load with a quantization config (e.g. bitsandbytes 4-bit/8-bit, AWQ, etc.).(Hugging Face)
  • _hf_peft_config_loaded is set when the model has PEFT adapters attached (e.g. LoRA, other PEFT methods).(GitHub)
  • _is_peft_model(model) checks if this is actually a PeftModel.

If the model is quantized and not a PEFT model, Trainer raises the ValueError before it cares which parameters are frozen.

That’s why “frozen quantized backbone + tiny trainable projector” still hits the error.

1.2 Why this didn’t happen in 4.31.0

  • In 4.31.0, this quantization guard either didn’t exist or was much weaker.

  • People could do exactly what you did: load a model with load_in_4bit / quantization_config, freeze everything except a small head or projector, and call Trainer. It “worked”, but it was never a formally supported path.

  • Starting around late 2023, the HF team explicitly documented that pure 4-bit/8-bit training is not supported, only PEFT adapters on top of quantized weights are supported:

    • “It is not possible to perform pure 4bit training on these models… you can train these models by leveraging parameter efficient fine tuning methods (PEFT) and train adapters on top of them.”(Hugging Face)
    • “After a model is quantized it isn’t typically further trained… But since PEFT methods only add extra trainable parameters, this allows you to train a quantized model with a PEFT adapter on top.”(Hugging Face)

So:

  • Old version: no guard → your setup ran.
  • New version: guard added → same code now raises ValueError.

Your code is not “wrong”; the library policy changed.


2. Supported solution 1: train projector with an unquantized backbone, then quantize

This is the simplest way to get full gradient training of your tiny projector while staying fully supported and still freezing the LLM.

Idea:

  1. During training

    • Load the LLM without quantization (fp16/bf16).
    • Freeze all LLM parameters.
    • Attach your projector (linear layer).
    • Use Trainer normally – only projector parameters have gradients.
  2. After training

    • Quantize the trained model for inference (backbone + projector) using bitsandbytes / another quantizer.

This completely avoids the “purely quantized model” check, because the model isn’t quantized while training.

2.1 Sketch code (no LoRA, only your projector)

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer

model_name = "your-llm-here"

# 1. load in bf16/fp16 (not quantized)
backbone = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # or torch.float16 / torch.float32
    device_map="auto",
)

# 2. freeze backbone
for p in backbone.parameters():
    p.requires_grad = False

# 3. add your projector
hidden_size = backbone.config.hidden_size
proj_dim = 256  # example

backbone.projector = nn.Linear(hidden_size, proj_dim)
# by default, projector params have requires_grad=True

# 4. define a wrapper that uses projector in forward
class BackboneWithProjector(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, input_ids, attention_mask=None, labels=None):
        # example: take last hidden state as features
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        hidden = outputs.hidden_states[-1]      # [batch, seq, hidden]
        features = hidden[:, -1, :]             # [batch, hidden]
        logits = self.model.projector(features) # [batch, proj_dim]

        # you can plug in your task-specific loss here, or
        # return logits and compute loss in Trainer's compute_metrics
        return {"logits": logits}

model = BackboneWithProjector(backbone).to("cuda")

# 5. training as usual with Trainer
training_args = TrainingArguments(
    output_dir="./out",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    fp16=True,  # if using fp16
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,   # your dataset
    eval_dataset=eval_dataset,
)

trainer.train()

Important points:

  • The LLM weights are frozen, so VRAM cost is mostly forward activations, not optimizer states for billions of params.
  • Only the projector has gradients; this is exactly “full fine-tuning of your tiny module”.
  • No quantization is involved during training → no is_quantized flag → no error.

After training finishes, you can quantize the backbone for inference following the standard quantization docs.(Hugging Face)

This is the cleanest and most future-proof solution if you can afford to train with an unquantized backbone (bf16/fp16).


3. Supported solution 2: keep backbone quantized, but train projector via a manual loop

If you really want the memory savings of a quantized backbone during training and still do not want LoRA/adapters, the supported direction is:

  • Treat the quantized LLM as a frozen feature extractor.
  • Run your own PyTorch training loop that only optimizes your projector.
  • Do not use Trainer, because Trainer is where the “purely quantized model” check lives.(Hugging Face Forums)

Conceptually, this is similar to how QLoRA works: gradients go through a frozen quantized base into small extra weights, except you’re writing the loop yourself and your extra weights are your projector instead of LoRA matrices.(Hugging Face)

3.1 Sketch code (quantized backbone + projector, no Trainer)

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_name = "your-llm-here"

# 1. load quantized backbone
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
backbone = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto",
)

# 2. freeze backbone
for p in backbone.parameters():
    p.requires_grad = False

device = next(backbone.parameters()).device

# 3. projector in full precision
hidden_size = backbone.config.hidden_size
proj_dim = 256  # example
projector = nn.Linear(hidden_size, proj_dim).to(device)

optimizer = torch.optim.AdamW(projector.parameters(), lr=1e-4)
loss_fn = nn.CrossEntropyLoss()  # or any loss you need

backbone.eval()  # backbone as feature extractor

for epoch in range(num_epochs):
    for batch in train_dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        with torch.no_grad():
            outputs = backbone(
                input_ids=input_ids,
                attention_mask=attention_mask,
                output_hidden_states=True,
            )
            hidden = outputs.hidden_states[-1]       # [B, T, H]
            features = hidden[:, -1, :]              # [B, H] (example pooling)

        logits = projector(features)                 # [B, proj_dim]
        loss = loss_fn(logits, labels)

        optimizer.zero_grad()
        loss.backward()         # gradients only in projector
        optimizer.step()

Again, this gives you:

  • Full gradient updates on all parameters of the projector.
  • Quantized backbone used in inference mode only; no gradients through its weights.
  • No Trainer and thus no ValueError.

Caveats:

  • This is not the officially “blessed” training pattern for quantized models, because HF documentation strongly encourages PEFT adapters for fine-tuning quantized LLMs.(Hugging Face)
  • You must handle logging, evaluation, checkpointing, and distributed training yourself.

But if the constraint is “no LoRA, but I’m OK writing my own loop”, this fits that requirement.


4. PEFT / LoRA context (why the error mentions adapters at all)

Even though you don’t want LoRA, it’s useful to understand why the library keeps talking about “adapters”.

HF’s recommended pattern for quantized training is:(Hugging Face)

  1. Quantize the large base model (e.g. 4-bit or 8-bit with bitsandbytes).
  2. Keep the quantized base frozen.
  3. Add small PEFT adapters (LoRA or other methods) on top.
  4. Train only those adapters (millions of parameters instead of billions).
  5. Use Trainer or SFTTrainer on the resulting PeftModel.

This pattern:

  • Respects that quantized weights are difficult to update stably.
  • Keeps memory low but training stable (adapters are full precision).
  • Is exactly what QLoRA does: “backprop through a frozen, 4-bit quantized pretrained language model into LoRA adapters.”(Hugging Face)

The Trainer guard you hit is simply enforcing: if is_quantized=True and this is not a PEFT model, then stop.

So:

  • If you ever relax your “no LoRA” requirement, the most supported way to get “tiny trainable module + quantized backbone + Trainer” is to wrap your projector (or similar) as an adapter with PEFT, so that Trainer sees a PeftModel.(Hugging Face)

5. Hacky / unsafe workarounds (not recommended, but common)

People sometimes bypass the guard when they’re sure they’re only training new full-precision modules and not touching quantized base weights. Typical hacks (seen in various issues and projects):frowning:fossies.org)

  1. Manually flip flags on the model before passing it to Trainer:

    • model.is_quantized = False
    • or setattr(model, "_hf_peft_config_loaded", True)

    This tricks _is_quantized_and_base_model into being false.

  2. Wrap the quantized model inside your own nn.Module shell that does not expose is_quantized at all, and pass that wrapper to Trainer.

  3. Pin Transformers to ≤4.31.0 or another pre-guard version so the error never fires.(CSDN)

These approaches:

  • Can work in practice.
  • Are brittle: future releases may add different checks or rely more on is_quantized.
  • Put you outside what HF considers supported behavior (especially if any gradients ever touch quantized weights).

If you use them, make absolutely sure:

  • Only your projector’s parameters have requires_grad=True.
  • You have sanity checks (loss curves, validation metrics) to make sure training behaves sensibly.

6. Short reading list (good references with context)

These links reinforce the explanations above and are good to skim:

6.1 Exact error + Trainer behavior

  • Transformers GitHub issue – “Issue with Fine-tuning LLM for Classification”
    Shows the exact same error when trying to fine-tune a quantized Mistral model with Trainer; includes the code path in trainer.py that raises it.(GitHub)

  • HF forum – “Resolving ‘Cannot Perform Fine-Tuning on Purely Quantized Models’”
    Reproduces the error with Falcon, and the answer says clearly: you must attach adapters (PEFT/LoRA) or avoid fine-tuning a purely quantized model.(Hugging Face Forums)

  • StackOverflow – “ValueError: You cannot perform fine-tuning on purely quantized models”
    Q&A that explains the guard and suggests PEFT or custom loops as alternatives.(Stack Overflow)

6.2 Official docs on quantization + training

  • HF blog – 4-bit transformers with bitsandbytes and QLoRA
    Explicitly: “It is not possible to perform pure 4bit training on these models… you can train these models by leveraging parameter efficient fine tuning methods (PEFT).”(Hugging Face)

  • PEFT quantization guide
    Explains that quantized models are not usually further trained; PEFT lets you train extra parameters on top of a frozen quantized base.(Hugging Face)

  • Accelerate quantization guide – fine-tune a quantized model
    Repeats the same point for 4-bit and 8-bit: pure training is not supported; use PEFT adapters.(Hugging Face)

6.3 Broader context on QLoRA / PEFT

  • Overview of quantization schemes in Transformers
    Summarizes bitsandbytes, GPTQ, AWQ, etc., and notes limitations of training on quantized weights.(Hugging Face)

  • QLoRA / 4-bit fine-tuning tutorials and repos (e.g., official QLoRA repo)
    Show the canonical “frozen 4-bit base + LoRA adapters” training loop.(Hugging Face)


Final recap

  • The cause of your error is the new Trainer guard that forbids training any quantized base model unless it is wrapped as a PEFT model. It doesn’t care that only your projector has gradients.

  • To get full gradient training of your custom projector without LoRA:

    • Easiest: train with an unquantized (bf16/fp16) backbone, frozen, using Trainer; then quantize for inference.
    • If you must keep the backbone quantized during training: write a manual PyTorch training loop that uses the quantized LLM as a frozen feature extractor and only updates your projector.
  • LoRA/PEFT is the officially endorsed way to fine-tune on quantized models; your non-LoRA setup is slightly off the main road, so you either stay unquantized while training or bypass Trainer for the projector-only optimization.