How to use text only model -> [mistralai/Ministral-3-3B-Instruct-2512]

i need to use this model for text only for fine-tune domain specific task for text generation. Is anyone can help me? I dont want to Vision Encoder.

1 Like

Unless you want to completely remove the vision encoder from the model, it’s not that difficult.


You can fine-tune mistralai/Ministral-3-3B-Instruct-2512 for text-only generation without using the vision encoder at all. You do it by (1) never passing image inputs and (2) freezing vision-side parameters so nothing in the vision path trains.

The important background is: this checkpoint is multimodal by design. It is a ~3.4B language model + 0.4B vision encoder. That is in the official model card. (Hugging Face)


What “I don’t want the vision encoder” can mean

Meaning A (recommended): “I will never use images”

Do this:

  • Provide text-only prompts.
  • Do text-only loss during fine-tuning.
  • Freeze vision weights so they are inert.

This gives you a normal text generator behavior while staying on the official checkpoint. (Hugging Face)

Meaning B (harder): “I want the vision weights removed to save memory”

That is not an official distribution format for the original repo. It typically means:

  • Converting the checkpoint to a different architecture layout.
  • Editing config and weights.
  • Accepting that conversion can introduce differences.

There are community conversions that claim “vision encoder removed” (example: a “TextOnly” Llama-format conversion). Treat those as third-party artifacts and validate carefully. (Hugging Face)


The minimum working software stack (this matters a lot)

Ministral 3 support relies on newer Transformers and Mistral’s tokenizer library (mistral-common). The official HF model card explicitly tells you to install Transformers from main for FP8 and to install mistral-common >= 1.8.6 for correct tokenization. (Hugging Face)

If you use a stable older Transformers build, you will hit import and model-type errors. This is a very common failure mode. (Stack Overflow)

Recommended: train from BF16 weights

Use the BF16 checkpoint for fine-tuning. It is the same model family but avoids FP8 complexity. The official BF16 model card describes BF16 VRAM expectations and still includes the vision encoder as a component. (Hugging Face)


Step 1. Install (text-only fine-tuning friendly)

Use one of these patterns (pick one, do not mix randomly):

Option 1: Transformers v5 RC (often simplest)

Some Ministral family cards recommend the first v5 RC or main for Transformers and mistral-common >= 1.8.6. (Hugging Face)

Option 2: Transformers from main (needed for FP8 workflows)

The Instruct-2512 card specifically mentions installing Transformers from main for FP8 support and using mistral-common >= 1.8.6. (Hugging Face)

Practical note: if your environment cannot import Mistral3ForConditionalGeneration / MistralCommonBackend, you are almost always on the wrong Transformers build. (Stack Overflow)


Step 2. Text-only inference (no images, no vision encoder usage)

Transformers’ own Ministral3 docs show usage with Mistral3ForConditionalGeneration and MistralCommonBackend. (Hugging Face)
You just remove the image inputs and keep the chat template.

# deps (conceptually):
# - transformers v5 RC/main
# - mistral-common >= 1.8.6
# - torch, accelerate

import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"

tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Text-only chat. No images. No pixel_values.
messages = [
    {"role": "system", "content": "You write domain-specific text in the required style."},
    {"role": "user", "content": "Write a domain-style explanation of <TOPIC> with 3 bullet takeaways."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, temperature=0.2, do_sample=True)

print(tokenizer.decode(out[0], skip_special_tokens=True))

Why mistral-common matters

Mistral models are trained with Mistral’s tokenization rules. There have been real-world mismatches between mistral_common and the generic tokenizers backend that can change token IDs for edge cases (escaped strings etc.). That is why the model card tells you to install mistral-common and why this mismatch was filed as a Transformers bug. (Hugging Face)


Step 3. Make fine-tuning “language-only” in practice

Goal

  • Update only language behavior for your domain generation.
  • Keep vision components frozen and unused.

Two controls you should use

  1. Input discipline: never put images in your training examples.
  2. Parameter discipline: freeze vision parameters.

There is a subtle pitfall: in multimodal models, vision modules can also contain layer names like q_proj/k_proj/v_proj. So “LoRA target_modules” alone does not guarantee “LM-only LoRA.” A recent HF forum thread calls this out explicitly. (Hugging Face Forums)

Freeze vision parameters (simple, robust)

def freeze_vision(model):
    for name, p in model.named_parameters():
        n = name.lower()
        if "vision" in n or "image" in n or "pixel" in n:
            p.requires_grad = False

freeze_vision(model)

This is crude but effective. After this, even if something in the vision path exists, it will not train.


Step 4. Supervised fine-tuning (SFT) for domain text generation

For your use case (“domain-specific text generation”), the standard starting point is SFT: prompt → ideal completion pairs.

TRL’s SFTTrainer is the common “works-first” route. The official TRL docs show the basic pattern and explain that it can work with chat templates. (Hugging Face)

Dataset format you want

Store each row as either:

  • {"prompt": "...", "completion": "..."} (single turn), or
  • {"messages": [...]} (multi-turn chat)

If you want consistent style, put your style guide in the system message across the dataset.

Minimal SFT + LoRA recipe (text-only)

# deps:
# pip install trl peft datasets accelerate
# plus transformers v5 RC/main + mistral-common

import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"
tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Freeze vision
def freeze_vision(model):
    for name, p in model.named_parameters():
        n = name.lower()
        if "vision" in n or "image" in n or "pixel" in n:
            p.requires_grad = False

freeze_vision(model)

# LoRA config (common target modules)
# TRL guidance explains typical LoRA params and target_modules choices. :contentReference[oaicite:13]{index=13}
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

SYSTEM = "You are a domain-specific generator. Follow the domain style guide."

def formatting_func(example):
    # example has: prompt, completion
    msgs = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": example["prompt"]},
        {"role": "assistant", "content": example["completion"]},
    ]
    return tokenizer.apply_chat_template(msgs, tokenize=False)

ds = load_dataset("json", data_files={"train": "train.jsonl"})["train"]

cfg = SFTConfig(
    output_dir="ministral3_domain_lora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    num_train_epochs=1,
    bf16=True,
    max_seq_length=2048,
    packing=True,
    logging_steps=10,
    save_steps=200,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    peft_config=peft_config,
    args=cfg,
    formatting_func=formatting_func,
)

trainer.train()
trainer.save_model()

Why this structure

  • TRL’s SFTTrainer is built for this workflow. (Hugging Face)
  • PEFT LoRA reduces trainable params drastically. (Hugging Face)
  • Freezing vision avoids accidentally tuning vision blocks that share module names. (Hugging Face Forums)
  • Using the correct chat template avoids silent quality loss from format mismatch. (Mistral AI)

Step 5. If your GPU is small: QLoRA (4-bit) instead of BF16 LoRA

QLoRA means:

  • Load base model in 4-bit.
  • Train only LoRA adapters.

Hugging Face’s QLoRA overview is in the bitsandbytes 4-bit blog. (Hugging Face)
If you do 4-bit training, HF recommends NF4 for training 4-bit base models. (Hugging Face)
PEFT also documents the “quantize then train adapters” concept. (Hugging Face)

Minimal change (conceptually):

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",   # recommended for training 4-bit base :contentReference[oaicite:21]{index=21}
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config=bnb,
)

Pitfalls that commonly waste time

Pitfall 1: Wrong Transformers version

Symptom: missing Mistral3ForConditionalGeneration or MistralCommonBackend.
Fix: follow the model card’s guidance to use v5 RC/main and install mistral-common >= 1.8.6. (Hugging Face)

Pitfall 2: Tokenization mismatch

Symptom: model “sort of works” but has weird failures on logs, escaped strings, or format-heavy inputs.
Fix: use the Mistral tokenizer backend (mistral-common) as recommended, and treat edge cases seriously. (GitHub)

Pitfall 3: “LM-only LoRA” accidentally hits vision

Symptom: you think you tuned only LM layers, but adapters attach to vision blocks too.
Fix: freeze vision, and do not rely only on target_modules. (Hugging Face Forums)

Pitfall 4: Serving differences (vLLM)

If you later serve with vLLM, do not assume you can pass a chat template per request for Mistral tokenizers. vLLM explicitly errors or warns about chat_template for Mistral tokenizers. (vLLM)
Practical fix: render the prompt text yourself (apply chat template client-side) before sending to vLLM.


If you truly want “no vision encoder” at all

Your choices are:

  1. Stay official (recommended): ignore vision inputs, freeze vision weights. (Hugging Face)
  2. Use a third-party “TextOnly” conversion: higher compatibility, but it is not the original architecture and can differ. Example conversions exist. (Hugging Face)
  3. Pick a pure text-only base model (different model family). This is often simpler if you never need images.

Quick checklist for your case

  • Use mistralai/Ministral-3-3B-Instruct-2512-BF16 for training. (Hugging Face)
  • Install Transformers v5 RC/main and mistral-common >= 1.8.6. (Hugging Face)
  • Always format prompts with Mistral chat templates. (Mistral AI)
  • Freeze vision weights. Do text-only examples. (Hugging Face Forums)
  • Fine-tune with TRL SFTTrainer + PEFT LoRA, or QLoRA if VRAM is tight. (Hugging Face)

Summary bullets

  • The checkpoint includes a vision encoder. You cannot “turn it off” at the file level without conversion. (Hugging Face)
  • You can use it as text-only by never passing images and freezing vision parameters. (Hugging Face Forums)
  • Use Transformers v5 RC/main plus mistral-common >= 1.8.6 to avoid common setup and tokenization pitfalls. (Hugging Face)
  • Fine-tune with TRL SFTTrainer + PEFT LoRA, or QLoRA NF4 if VRAM is limited. (Hugging Face)

Thanks for information.

1 Like

But when i load this text only model, its showing lm_head missing error

import torch

from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend, GenerationConfig

model_id = “mistralai/Ministral-3-3B-Instruct-2512-BF16”

tokenizer = MistralCommonBackend.from_pretrained(model_id)

model = Mistral3ForConditionalGeneration.from_pretrained(

model_id,

device_map="auto",

torch_dtype=torch.bfloat16,

)

Download complete:

0.00/0.00 [00:00<?, ?B/s]

Fetching 2 files: 100%

2/2 [00:00<00:00, 133.00it/s]

Loading weights: 100%

458/458 [00:21<00:00, 71.59it/s, Materializing param=model.vision_tower.transformer.layers.23.ffn_norm.weight]

Mistral3ForConditionalGeneration LOAD REPORT from: mistralai/Ministral-3-3B-Instruct-2512-BF16
Key            | Status  | 
---------------+---------+-
lm_head.weight | MISSING | 

Notes:
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu.
1 Like

seems okay to ignore that error message
?


Yes. Your lm_head.weight “MISSING” is usually expected for Ministral-3 3B. It is not a real load failure in most setups. It happens because the 3B variant uses tied embeddings, meaning the output head shares weights with the input token embedding, and checkpoints often do not store a separate lm_head.weight. (Hugging Face)

Below is what is going on, how to verify it, and how to run and fine-tune text-only without using vision.


1) First, this is not actually a “text-only model”

mistralai/Ministral-3-3B-Instruct-2512* is a vision-language model (VLM): a ~3.4B language model plus a ~0.4B vision encoder. (Hugging Face)

Key point:

  • Text-only usage is supported. You simply do not pass images. vLLM’s official recipe even has a “Text only request” section showing that the same served model can be used with only text messages. (vLLM)
  • But loading Mistral3ForConditionalGeneration will still load the vision tower weights unless you use a text-only-extracted checkpoint.

So you have two “text-only” meanings:

  1. Text-only inputs (keep vision weights loaded but unused). This is the official path.
  2. Text-only weights (do not load vision tower at all). This requires a text-only checkpoint or you extracting weights yourself.

2) Why lm_head.weight is “missing” for the 3B model

Ministral-3 3B is special:

  • It uses tied embeddings (“share the embedding and output layers”). (vLLM)
  • Model docs/notes explicitly say 3B has tied embeddings and “no output layer” to reduce weights. (Hugging Face)
  • The BF16 config for the model also indicates embedding tying in the text config (tie_word_embeddings: true). (Hugging Face)

So a checkpoint can legitimately omit lm_head.weight. This is the same pattern you see in other HF models with tied heads (classic example: T5). (GitHub)

Why the loader prints a scary message anyway

Some loading paths print “MISSING” before weight tying is applied, or they warn even when it is safe. This confusion is a known theme across models and tools when tied parameters are involved. (GitHub)


3) What you should do right now (verify + make it safe)

A. Treat it as a warning unless generation is broken

If generation quality looks normal, it is probably fine.

B. Verify that output weights are actually tied

Run a pointer check. If tied, both tensors share storage.

import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"

tok = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
).eval()

# Harmless even if already tied
model.tie_weights()

inp_emb = model.get_input_embeddings().weight
out_emb = model.get_output_embeddings().weight

print("tied?", inp_emb.data_ptr() == out_emb.data_ptr())
print("shapes:", inp_emb.shape, out_emb.shape)

Expected result for the 3B variant: tied? True.

C. If it is NOT tied, force it

This should not usually be needed, but it is the simplest fix when it happens:

model.config.tie_word_embeddings = True
model.tie_weights()

D. If you use device_map="auto" with CPU offload

Tied weights must live on compatible devices. If you later see device mismatch errors, load fully on one device (or CPU), tie, then move.


4) Text-only inference with the official VLM (no images)

You can do text-only generation by giving only text messages.

import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"
tok = MistralCommonBackend.from_pretrained(model_id)

model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
).eval()
model.tie_weights()

messages = [
    {"role": "system", "content": "You are a domain assistant. Answer using the company style guide."},
    {"role": "user", "content": "Write a short incident report for a database failover."},
]

inputs = tok.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
)

# MistralCommonBackend may return a dict-like payload
if isinstance(inputs, dict):
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    out = model.generate(**inputs, max_new_tokens=200)
    print(tok.decode(out[0], skip_special_tokens=True))
else:
    inputs = inputs.to(model.device)
    out = model.generate(inputs, max_new_tokens=200)
    print(tok.decode(out[0], skip_special_tokens=True))

This uses no vision inputs, so the vision tower is unused at runtime, even though it is loaded. Text-only usage is a documented/expected workflow. (vLLM)


5) Fine-tuning for domain-specific text generation (without training vision)

Reality check: “I don’t want the vision encoder”

You likely mean one (or both):

  1. Do not train vision parameters. Easy.
  2. Do not load vision parameters. Harder unless you use a text-only checkpoint.

Option A (most common): keep the official checkpoint, freeze vision, train text only

This works well if you do LoRA/QLoRA SFT.

Freeze vision modules (typical names):

# After loading model...
for name, p in model.named_parameters():
    if name.startswith("vision_tower"):
        p.requires_grad = False

# Some VLMs also have a projector/adapter; freeze if present
if hasattr(model, "multi_modal_projector"):
    for p in model.multi_modal_projector.parameters():
        p.requires_grad = False

Then apply LoRA only to text transformer modules (common Mistral-style targets):

  • q_proj, k_proj, v_proj, o_proj
  • gate_proj, up_proj, down_proj

Pitfall: because this model is exposed as a conditional-generation VLM class in some stacks, some trainers that assume AutoModelForCausalLM can break (people hit this in evaluation harnesses). (GitHub)

Option B (cleanest): use a text-only extracted checkpoint

If your hard requirement is “no vision weights in memory at all”, use a text-only extracted model.

One example: Aratako/Ministral-3-3B-Instruct-2512-TextOnly explicitly says it is the text-only component and can be loaded via AutoModelForCausalLM. (Hugging Face)

This makes fine-tuning straightforward because you are back in standard CausalLM tooling.

Tradeoffs:

  • It is third-party packaging. You must validate outputs and licensing assumptions yourself (it claims Apache-2.0 and points back to the original). (Hugging Face)
  • But it solves the “don’t load vision encoder” requirement completely.

6) Common pitfalls and “similar issues online” you should recognize

Pitfall 1: Tied embeddings confuse loaders, savers, and sharding tools

  • “Missing lm_head.weight” warnings have shown up for years in tied-head models (T5 is the canonical case). (GitHub)
  • Some recipes recommend explicitly tying weights after load when resuming or manipulating checkpoints. (Hugging Face Forums)

Pitfall 2: Some distributed wrappers crash when a tied key is absent

Accelerate FSDP2 has had a real KeyError: 'lm_head.weight' class of failure when tied weights are involved. (GitHub)
If you see this during fine-tuning, workarounds include:

  • avoid FSDP2 for now (use DDP/FSDP1/single-GPU),
  • ensure tying is done consistently before wrapping,
  • or use a model variant without tied embeddings (8B/14B). vLLM notes 8B/14B have separate embedding/output layers. (vLLM)

Pitfall 3: Version skew for new model families

Ministral-3 / Mistral3 landed recently in Transformers docs. (Hugging Face)
If you see weird behavior, check you are on a compatible Transformers build. There have also been v5-era Mistral3 issues around embedding tying logic. (GitHub)


7) High-quality docs, guides, issues to keep open while implementing

  • Official model card (architecture, variants, intended use): (Hugging Face)
  • Transformers “Ministral3” documentation (classes, usage examples): (Hugging Face)
  • vLLM recipe for Ministral-3 (notes tied embeddings, shows text-only request): (vLLM)
  • Mistral/Unsloth notes calling out “tied embeddings, no output layer” for 3B: (Hugging Face)
  • HF forum thread on missing lm_head.weight during resume (how people reason about it): (Hugging Face Forums)
  • Transformers issue explaining missing lm_head in tied-head models (T5 example): (GitHub)
  • Accelerate issue where tied key absence causes crash in FSDP2: (GitHub)
  • Text-only extracted checkpoint example: (Hugging Face)

“Your case” recommendation

If your priority is domain text generation fine-tuning and you want minimum risk:

  1. Use BF16 weights (good for training). The main instruct model is often distributed in FP8 for inference. (Hugging Face)

  2. Decide your hard constraint:

    • If you can tolerate vision weights being loaded but unused: keep the official model, freeze vision, LoRA the text backbone.
    • If you truly need “no vision encoder at all”: start from a text-only extracted checkpoint and fine-tune as AutoModelForCausalLM. (Hugging Face)
  3. Don’t panic about lm_head.weight missing on 3B. It matches “tied embeddings, no output layer”. Verify tying with the pointer test and call model.tie_weights().


Summary

  • lm_head.weight “MISSING” is expected for Ministral-3 3B because it uses tied embeddings and may not store a separate output head. (Hugging Face)
  • For safety: run the “tied?” pointer check and call model.tie_weights() after loading.
  • Text-only inference is supported by sending only text. (vLLM)
  • To not load vision at all, use a text-only extracted checkpoint (or extract yourself). (Hugging Face)

Thanks a lot.

1 Like

for inference 3B or 14B, model is generating gibberish. is there any thing missing in my code? Im adding code for reference. Please have a look.

import torch

from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = “mistralai/Ministral-3-14B-Instruct-2512”

tokenizer = MistralCommonBackend.from_pretrained(model_id, trust_remote_code=True,

                                      cache_dir="/content/huggingface_cache")

tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = “right”

model = Mistral3ForConditionalGeneration.from_pretrained(

model_id,

# device_map=“auto”,

device_map="sequential",

torch_dtype=torch.bfloat16,

trust_remote_code=True,

cache_dir="/content/huggingface_cache",

low_cpu_mem_usage=True,

force_download=True,

).eval()

model.config.tie_word_embeddings = True

model.tie_weights()

inp_emb = model.get_input_embeddings().weight

out_emb = model.get_output_embeddings().weight

print(“tied?”, inp_emb.data_ptr() == out_emb.data_ptr())

print(“shapes:”, inp_emb.shape, out_emb.shape)

from transformers import TextStreamer, GenerationConfig

messages = [

{"role": "system", "content": "You are a AI domain assistant."},

{"role": "user", "content": "Create a podcast on elon musk. it must be a monologue and 20000 characters."},

]

inputs = tokenizer.apply_chat_template(

messages,

tokenize=True,

return_tensors="pt",

return_dict=True,

)

inputs_gen = {k: v.to(model.device) for k, v in inputs.items()}

streamer = TextStreamer(tokenizer, skip_special_tokens=True)

gen_config = GenerationConfig(

max_new_tokens=20000,

do_sample=True,

temperature=0.15,

top_p=0.9, 

)

with torch.no_grad():

out = model.generate(\*\*inputs_gen, generation_config=gen_config, streamer=streamer)

print(tokenizer.decode(out[0], skip_special_tokens=True))

ـYou are a AI domain assistant.Create a podcast on elon musk. it must be a monologue and 20000 characters. Timeline hyperbol杄/>avanjeèČ /> mac mac mac orderly flank flank flank flank flank reput Sent Sent flank flank popular popular popular popularfĂ€h macroscopic epidemic epidemic.end矩 competition competition ĐžŃĐżĐŸĐ»ŃŒĐ·Ńƒ uĆŒy Course-INFinf minor epidemic àŠ« anisotropy repertoire repertoire TBいい TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB

---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

/tmp/ipython-input-707871964.py in <cell line: 0>()
     24 
     25 with torch.no_grad():
---> 26     out = model.generate(**inputs_gen, generation_config=gen_config, streamer=streamer)
     27 
     28 print(tokenizer.decode(out[0], skip_special_tokens=True))



21 frames


/usr/local/lib/python3.12/dist-packages/transformers/models/ministral3/modeling_ministral3.py in forward(self, hidden_states)
    208         hidden_states = hidden_states.to(torch.float32)
    209         variance = hidden_states.pow(2).mean(-1, keepdim=True)
--> 210         hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
    211         return self.weight * hidden_states.to(input_dtype)
    212 


KeyboardInterrupt:  

The above setup was on an A100.
And I tried the same with Ollama locally and the Ministral models were generating just fine.

How can we fixed it?

1 Like

Could it be because you’re using the fp8 version of the model weights?

I’ve confirmed that mistralai/Ministral-3-3B-Instruct-2512-BF16 runs fine on a T4 GPU environment. I couldn’t test the 8B or 14B versions due to insufficient VRAM



Your code is missing two critical correctness rules for Ministral-3, and one practical rule for long generations. Fix those and the “gibberish” problem usually disappears.

What is actually happening

1) You are forcing tied embeddings on 14B. That is wrong.

  • 3B has tied embeddings (input embedding and output head share weights).
  • 8B and 14B do not. They have separate embedding and output layers. (vLLM)

So this part in your 14B code is a bug:

model.config.tie_word_embeddings = True
model.tie_weights()

On 14B it can corrupt the output distribution and produce nonsense tokens.

2) You are using the FP8 checkpoint on an A100.

mistralai/Ministral-3-14B-Instruct-2512 is explicitly FP8. (Hugging Face)
There is a separate BF16 checkpoint: mistralai/Ministral-3-14B-Instruct-2512-BF16. (Hugging Face)

On A100, BF16 is the safest path. The FP8 checkpoint can be “converted to BF16” in Transformers, but that conversion is a separate codepath and can be fragile across versions and configs. The HF discussion explicitly mentions this BF16 conversion path. (Hugging Face)

3) You are asking for max_new_tokens=20000 with streaming.

Even if everything is correct, that will be:

  • very slow (especially if device_map="sequential" causes CPU/GPU shuffling)
  • memory heavy due to KV cache growth
  • likely to be interrupted (your KeyboardInterrupt is consistent with “it’s just taking forever”, not necessarily a crash)

Also, for chat-style generation, it is recommended to use chat templates properly and include a generation prompt when applicable. HF chat template docs recommend add_generation_prompt=True. (Hugging Face)

Fix in 30 seconds (14B on A100)

  1. Switch to the BF16 checkpoint:
    mistralai/Ministral-3-14B-Instruct-2512-BF16 (Hugging Face)

  2. Remove the tie-weights lines entirely (14B must be untied). (vLLM)

  3. Use device_map="auto" (not "sequential") for sane placement and speed.

  4. Generate in chunks. Do not do 20000 tokens in one generate() call.


Clean “known-good” inference code for 14B BF16 on A100 (text-only)

This is the simplest baseline to verify correctness. It uses deterministic decoding first (no sampling) so you can tell if the model is loaded correctly.

# deps:
#   pip install -U "transformers==5.0.0rc0" "mistral-common>=1.8.6" accelerate torch
# refs:
# - 14B BF16 model card: https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512-BF16
# - 14B FP8 model card (shows FP8): https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512
# - 3B tied vs 14B not tied: https://docs.vllm.ai/projects/recipes/en/latest/Mistral/Ministral-3-Instruct.html
# - chat templates guidance: https://huggingface.co/docs/transformers/en/chat_templating

import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-14B-Instruct-2512-BF16"

tok = MistralCommonBackend.from_pretrained(model_id)

model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
).eval()

# IMPORTANT: do NOT tie weights for 14B (only 3B is tied)
# model.config.tie_word_embeddings = True  # WRONG for 14B
# model.tie_weights()                     # WRONG for 14B

messages = [
    {"role": "system", "content": "You are an AI domain assistant."},
    {"role": "user", "content": "Write a short incident report for a database failover. Use 5 bullet points."},
]

inputs = tok.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=220,
        do_sample=False,
    )

# Decode only the newly generated tokens (avoid re-printing the prompt)
prompt_len = inputs["input_ids"].shape[1]
new_tokens = out[0, prompt_len:]
print(tok.decode(new_tokens, skip_special_tokens=True))

If this baseline is coherent, your “gibberish” was caused by your previous loading choices, not the model.


If you insist on using the FP8 checkpoint on A100

You can, but treat it as “advanced mode”.

  • The model card says it is FP8. (Hugging Face)
  • The repo discussion mentions Transformers can convert the checkpoint to BF16 and references FineGrainedFP8Config. (Hugging Face)

If you do not need FP8 specifically, do not do this. Use BF16.


Fixing your 3B path (if 3B also looks bad)

For 3B:

  • tied embeddings are expected. (vLLM)
  • prefer BF16 checkpoint on A100: ...-3B-Instruct-2512-BF16 (Hugging Face)
  • use the same “decode only new tokens” pattern

Common “looks like gibberish but isn’t” cases:

  • you decode the entire sequence and see the prompt plus completion
  • you stream tokens and then decode again, so it looks duplicated and messy

How to generate “20000 characters” without going off the rails

Do not do max_new_tokens=20000 in one call. Instead:

  1. Generate 512–2048 tokens per step.
  2. Append a short trailing window (last 256–512 tokens) back into the next prompt.
  3. Stop when character count is reached.

This avoids huge KV cache growth and avoids the “it runs forever so I interrupt it” pattern.


Why Ollama “works fine” while Transformers gave junk

Ollama typically uses:

  • a converted runtime format and a fixed known-good prompt template
  • a stable quantized inference path

Your Transformers run combined:

  • a FP8 checkpoint being coerced to BF16
  • forced weight tying on a model that should not be tied (14B)
  • very long generation with a slow device map

Those are exactly the kinds of differences that produce “nonsense tokens” in one stack and “fine” output in another.


Minimal checklist

  • Use BF16 checkpoints for A100 when available. (
-BF16) (Hugging Face)
  • For 14B: do not tie embeddings. (vLLM)
  • Use device_map="auto" not "sequential".
  • Use add_generation_prompt=True with chat templates. (Hugging Face)
  • Decode only newly generated tokens.
  • For huge outputs, generate in chunks.
1 Like

Thanks for the info — that solved my issue! Really appreciate the quick help.

Next, I’ll try fine-tuning the 14B model on some domain-specific data using the template shared earlier and see how it performs.

1 Like
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend

model_id = "mistralai/Ministral-3-14B-Instruct-2512-BF16"

tokenizer = MistralCommonBackend.from_pretrained(model_id, cache_dir="/content/huggingface_cache")

model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    cache_dir="/content/huggingface_cache",
)



def freeze_non_lm(model):
    for name, param in model.named_parameters():
        # Only train language model layers
        if not name.startswith("model.language_model") and not name.startswith("lm_head"):
            param.requires_grad = False

freeze_non_lm(model)

# Verify
for name, param in model.named_parameters():
    print(name, param.requires_grad)


# tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# LoRA config (common target modules)
# TRL guidance explains typical LoRA params and target_modules choices. :contentReference[oaicite:13]{index=13}

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)



def create_text(example):
    msgs = [
        {"role": "system", "content": example["system"]},
        {"role": "user", "content": example["user"]},
        {"role": "assistant", "content": example["assistant"]},
    ]
    # return {"text": tokenizer.apply_chat_template(
    #     msgs,
    #     tokenize=True,
    #     continue_final_message=True,
    # )}
    # Just create the chat string without tokenizing
    text = tokenizer.apply_chat_template(msgs, tokenize=True, continue_final_message=True)
    return {"text": text}

ds = ds.map(create_text)
ds

from transformers import DataCollatorForSeq2Seq

# Training arguments
cfg = SFTConfig(
    output_dir="ministral3_domain_lora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    num_train_epochs=1,
    bf16=True,
    max_length=2048,
    packing=False,
    logging_steps=10,
    save_steps=200,
    report_to="none",
    # push_to_hub=True,
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

trainer = SFTTrainer(
    model=model,
    train_dataset=ds,
    peft_config=peft_config,
    args=cfg,
    formatting_func=create_text,
    data_collator=data_collator
    
)

trainer.train()
# trainer.save_model()  
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipython-input-2362894520.py in <cell line: 0>()
     23 )
     24 
---> 25 trainer = SFTTrainer(
     26     model=model,
     27     train_dataset=ds,

7 frames
/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py in add_eos(example, eos_token)
    972 
    973                     def add_eos(example, eos_token):
--> 974                         if "text" in example and not example["text"].endswith(eos_token):  # language modeling case
    975                             example["text"] = example["text"] + eos_token
    976                         elif "completion" in example and not example["completion"].endswith(eos_token):

AttributeError: 'dict' object has no attribute 'endswith'
Adding EOS to train dataset:   0%
 0/230 [00:00<?, ? examples/s]

I encountered an error related to eos. Could you help me understand how to fix it? Also, could you check if my training script is correct?

And if we set True for packing, its show the error


ValueError                                Traceback (most recent call last)

/tmp/ipython-input-2632563400.py in <cell line: 0>()
     23 )
     24 
---> 25 trainer = SFTTrainer(
     26     model=model,
     27     train_dataset=ds,


/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizers, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func)
    675         # Catch some wrong configurations related to VLMs
    676         if self._is_vlm and args.packing:
--> 677             raise ValueError(
    678                 "Packing is not supported for vision-language models. Please set `packing=False` in the SFTConfig."
    679             )


ValueError: Packing is not supported for vision-language models. Please set `packing=False` in the SFTConfig. 
1 Like

maybe:


Why you get the EOS error (endswith on a dict)

trl.SFTTrainer has a dataset-prep step that (for non-conversational datasets) appends an EOS token by doing:

  • example["text"].endswith(eos_token) (language-modeling “text” datasets), or
  • example["completion"].endswith(eos_token) (prompt/completion “string” datasets)

If example["text"] is not a Python string, this crashes with:

AttributeError: 'dict' object has no attribute 'endswith'

That’s what’s happening in your script for two reasons:

  1. Your create_text() returns tokenized output (a dict-like object), not a string
    apply_chat_template(..., tokenize=True, ...) returns tokenized output, whereas tokenize=False returns a string. (Hugging Face)

  2. You’re applying formatting twice
    You do ds = ds.map(create_text) (adds a "text" column), and then you also pass formatting_func=create_text to SFTTrainer. In TRL, formatting_func must return a string, and TRL wraps it as {"text": formatting_func(example)}. (Hugging Face)
    Since your create_text() itself returns {"text": ...}, you end up with nested dicts, and then the EOS appender hits .endswith() on a dict. (GitHub)


Fixing the EOS error (pick one clean approach)

Option A (minimal change): keep formatting_func, but return a string and don’t pre-map

  • Remove ds = ds.map(create_text)
  • Make formatting_func return a string (use tokenize=False)
  • Don’t return {"text": ...} from the formatting function
def formatting_func(example):
    msgs = [
        {"role": "system", "content": example["system"]},
        {"role": "user", "content": example["user"]},
        {"role": "assistant", "content": example["assistant"]},
    ]
    # IMPORTANT: tokenize=False -> returns a STRING
    return tokenizer.apply_chat_template(msgs, tokenize=False)

trainer = SFTTrainer(
    model=model,
    train_dataset=ds,
    processing_class=tokenizer,          # important for your packing issue; see below
    peft_config=peft_config,
    args=cfg,
    formatting_func=formatting_func,
)

Why this works:


Option B: precompute a "text" column (string) and don’t pass formatting_func

def create_text(example):
    msgs = [
        {"role": "system", "content": example["system"]},
        {"role": "user", "content": example["user"]},
        {"role": "assistant", "content": example["assistant"]},
    ]
    return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}

ds = ds.map(create_text)

trainer = SFTTrainer(
    model=model,
    train_dataset=ds,
    processing_class=tokenizer,
    peft_config=peft_config,
    args=cfg,
    # no formatting_func
)

Option C (often best for chat SFT): use a conversational "messages" column

This avoids the “append EOS to raw text” path and lets TRL treat the dataset as conversational.

def to_messages(example):
    return {
        "messages": [
            {"role": "system", "content": example["system"]},
            {"role": "user", "content": example["user"]},
            {"role": "assistant", "content": example["assistant"]},
        ]
    }

ds = ds.map(to_messages)

cfg = SFTConfig(
    output_dir="ministral3_domain_lora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    num_train_epochs=1,
    bf16=True,
    max_length=2048,
    packing=False,              # or True after fixing VLM detection (next section)
    assistant_only_loss=True,   # optional, but common for chat SFT
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=ds,
    processing_class=tokenizer,
    peft_config=peft_config,
    args=cfg,
)

TRL explicitly documents “standard” (text) vs “conversational” (messages) dataset formats, and the trainer signature shows formatting_func returns a str. (Hugging Face)


Why packing=True raises “not supported for vision-language models”

Your checkpoint is a vision-language model (VLM) (it includes vision capability even if you only use text). (Hugging Face)

In TRL, if you don’t pass processing_class=..., it calls AutoProcessor.from_pretrained(...). If that returns a ProcessorMixin, TRL sets _is_vlm = True, and then it hard-errors on packing=True:

“Packing is not supported for vision-language models.” (GitHub)

How to enable packing for text-only training

If your dataset truly has no images and you want packing, you can usually force TRL into text-only mode by passing the tokenizer as processing_class:

trainer = SFTTrainer(
    model=model,
    train_dataset=ds,
    processing_class=tokenizer,   # <- makes TRL treat it as text (PreTrainedTokenizerBase)
    peft_config=peft_config,
    args=SFTConfig(packing=True, ...),
)

This works because TRL decides “VLM vs LLM” primarily from the type of processing_class (ProcessorMixin => VLM; TokenizerBase => not VLM). (GitHub)


Script audit: other issues / improvements

1) Your comment says “create string without tokenizing” but you used tokenize=True

That returns tokenized outputs (dict-like), not a string. Use tokenize=False for strings. (Hugging Face)

2) Don’t use continue_final_message=True for standard supervised chat examples

continue_final_message=True explicitly formats the last message as open-ended without EOS tokens (meant for prefilling a partial assistant response). For typical SFT rows where the assistant answer is complete, it’s better left off. (Hugging Face)

3) You imported the wrong collator and didn’t import the one you use

You imported DataCollatorForSeq2Seq but instantiate DataCollatorForLanguageModeling. In most TRL SFT cases you can simply omit data_collator and let TRL choose (it picks an LM collator vs VLM collator automatically). (Hugging Face)

4) Manual freezing is usually unnecessary when using PEFT via peft_config=...

SFTTrainer will wrap your model with get_peft_model() when peft_config is provided. (GitHub)
If you do want “only LoRA trains”, verify trainable params after trainer init (PEFT provides helpers), but avoid relying on name-prefix heuristics unless you’re sure about module names.

5) Prefer processing_class=tokenizer for this family right now

There are known cases where AutoProcessor / AutoTokenizer paths behave differently from MistralCommonBackend for Ministral tokenization; people have reported cleaner behavior using MistralCommonBackend directly. (GitHub)


Similar/related issues and good references (with context)

  • TRL SFTTrainer docs (dataset formats, packing, assistant-only loss, signature of formatting_func) (Hugging Face)
    Use this to align your dataset shape (“text” vs “messages” vs “prompt/completion”) and understand when packing is appropriate.

  • TRL implementation showing why your two errors occur (VLM detection, packing restriction, and the .endswith(eos) step) (GitHub)
    This is the exact logic behind both stack traces.

  • Transformers tokenizer apply_chat_template contract (string vs tokenized output; continue_final_message meaning) (Hugging Face)
    This explains why tokenize=True gave you a dict-like object.

  • Transformers issue involving Ministral tokenization/processor differences (AutoProcessor/AutoTokenizer vs MistralCommonBackend) (GitHub)
    Relevant because TRL defaults to AutoProcessor unless you override processing_class.

  • TRL issue about VLM handling (context: VLMs and trainer expectations differ from pure CausalLM workflows) (GitHub)

  • Model background: Mistral 3 family adds vision (Hugging Face)
    Explains why tooling often classifies these checkpoints as VLMs even when you feed text only.

If you apply Option A/B/C and also pass processing_class=tokenizer, you should eliminate both the EOS crash and the packing error (assuming your dataset has no image columns).