i need to use this model for text only for fine-tune domain specific task for text generation. Is anyone can help me? I dont want to Vision Encoder.
Unless you want to completely remove the vision encoder from the model, itâs not that difficult.
You can fine-tune mistralai/Ministral-3-3B-Instruct-2512 for text-only generation without using the vision encoder at all. You do it by (1) never passing image inputs and (2) freezing vision-side parameters so nothing in the vision path trains.
The important background is: this checkpoint is multimodal by design. It is a ~3.4B language model + 0.4B vision encoder. That is in the official model card. (Hugging Face)
What âI donât want the vision encoderâ can mean
Meaning A (recommended): âI will never use imagesâ
Do this:
- Provide text-only prompts.
- Do text-only loss during fine-tuning.
- Freeze vision weights so they are inert.
This gives you a normal text generator behavior while staying on the official checkpoint. (Hugging Face)
Meaning B (harder): âI want the vision weights removed to save memoryâ
That is not an official distribution format for the original repo. It typically means:
- Converting the checkpoint to a different architecture layout.
- Editing config and weights.
- Accepting that conversion can introduce differences.
There are community conversions that claim âvision encoder removedâ (example: a âTextOnlyâ Llama-format conversion). Treat those as third-party artifacts and validate carefully. (Hugging Face)
The minimum working software stack (this matters a lot)
Ministral 3 support relies on newer Transformers and Mistralâs tokenizer library (mistral-common). The official HF model card explicitly tells you to install Transformers from main for FP8 and to install mistral-common >= 1.8.6 for correct tokenization. (Hugging Face)
If you use a stable older Transformers build, you will hit import and model-type errors. This is a very common failure mode. (Stack Overflow)
Recommended: train from BF16 weights
Use the BF16 checkpoint for fine-tuning. It is the same model family but avoids FP8 complexity. The official BF16 model card describes BF16 VRAM expectations and still includes the vision encoder as a component. (Hugging Face)
Step 1. Install (text-only fine-tuning friendly)
Use one of these patterns (pick one, do not mix randomly):
Option 1: Transformers v5 RC (often simplest)
Some Ministral family cards recommend the first v5 RC or main for Transformers and mistral-common >= 1.8.6. (Hugging Face)
Option 2: Transformers from main (needed for FP8 workflows)
The Instruct-2512 card specifically mentions installing Transformers from main for FP8 support and using mistral-common >= 1.8.6. (Hugging Face)
Practical note: if your environment cannot import Mistral3ForConditionalGeneration / MistralCommonBackend, you are almost always on the wrong Transformers build. (Stack Overflow)
Step 2. Text-only inference (no images, no vision encoder usage)
Transformersâ own Ministral3 docs show usage with Mistral3ForConditionalGeneration and MistralCommonBackend. (Hugging Face)
You just remove the image inputs and keep the chat template.
# deps (conceptually):
# - transformers v5 RC/main
# - mistral-common >= 1.8.6
# - torch, accelerate
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"
tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Text-only chat. No images. No pixel_values.
messages = [
{"role": "system", "content": "You write domain-specific text in the required style."},
{"role": "user", "content": "Write a domain-style explanation of <TOPIC> with 3 bullet takeaways."},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
return_dict=True,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=256, temperature=0.2, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Why mistral-common matters
Mistral models are trained with Mistralâs tokenization rules. There have been real-world mismatches between mistral_common and the generic tokenizers backend that can change token IDs for edge cases (escaped strings etc.). That is why the model card tells you to install mistral-common and why this mismatch was filed as a Transformers bug. (Hugging Face)
Step 3. Make fine-tuning âlanguage-onlyâ in practice
Goal
- Update only language behavior for your domain generation.
- Keep vision components frozen and unused.
Two controls you should use
- Input discipline: never put images in your training examples.
- Parameter discipline: freeze vision parameters.
There is a subtle pitfall: in multimodal models, vision modules can also contain layer names like q_proj/k_proj/v_proj. So âLoRA target_modulesâ alone does not guarantee âLM-only LoRA.â A recent HF forum thread calls this out explicitly. (Hugging Face Forums)
Freeze vision parameters (simple, robust)
def freeze_vision(model):
for name, p in model.named_parameters():
n = name.lower()
if "vision" in n or "image" in n or "pixel" in n:
p.requires_grad = False
freeze_vision(model)
This is crude but effective. After this, even if something in the vision path exists, it will not train.
Step 4. Supervised fine-tuning (SFT) for domain text generation
For your use case (âdomain-specific text generationâ), the standard starting point is SFT: prompt â ideal completion pairs.
TRLâs SFTTrainer is the common âworks-firstâ route. The official TRL docs show the basic pattern and explain that it can work with chat templates. (Hugging Face)
Dataset format you want
Store each row as either:
{"prompt": "...", "completion": "..."}(single turn), or{"messages": [...]}(multi-turn chat)
If you want consistent style, put your style guide in the system message across the dataset.
Minimal SFT + LoRA recipe (text-only)
# deps:
# pip install trl peft datasets accelerate
# plus transformers v5 RC/main + mistral-common
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"
tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Freeze vision
def freeze_vision(model):
for name, p in model.named_parameters():
n = name.lower()
if "vision" in n or "image" in n or "pixel" in n:
p.requires_grad = False
freeze_vision(model)
# LoRA config (common target modules)
# TRL guidance explains typical LoRA params and target_modules choices. :contentReference[oaicite:13]{index=13}
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)
SYSTEM = "You are a domain-specific generator. Follow the domain style guide."
def formatting_func(example):
# example has: prompt, completion
msgs = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": example["prompt"]},
{"role": "assistant", "content": example["completion"]},
]
return tokenizer.apply_chat_template(msgs, tokenize=False)
ds = load_dataset("json", data_files={"train": "train.jsonl"})["train"]
cfg = SFTConfig(
output_dir="ministral3_domain_lora",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
num_train_epochs=1,
bf16=True,
max_seq_length=2048,
packing=True,
logging_steps=10,
save_steps=200,
report_to="none",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=ds,
peft_config=peft_config,
args=cfg,
formatting_func=formatting_func,
)
trainer.train()
trainer.save_model()
Why this structure
- TRLâs
SFTTraineris built for this workflow. (Hugging Face) - PEFT LoRA reduces trainable params drastically. (Hugging Face)
- Freezing vision avoids accidentally tuning vision blocks that share module names. (Hugging Face Forums)
- Using the correct chat template avoids silent quality loss from format mismatch. (Mistral AI)
Step 5. If your GPU is small: QLoRA (4-bit) instead of BF16 LoRA
QLoRA means:
- Load base model in 4-bit.
- Train only LoRA adapters.
Hugging Faceâs QLoRA overview is in the bitsandbytes 4-bit blog. (Hugging Face)
If you do 4-bit training, HF recommends NF4 for training 4-bit base models. (Hugging Face)
PEFT also documents the âquantize then train adaptersâ concept. (Hugging Face)
Minimal change (conceptually):
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # recommended for training 4-bit base :contentReference[oaicite:21]{index=21}
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
quantization_config=bnb,
)
Pitfalls that commonly waste time
Pitfall 1: Wrong Transformers version
Symptom: missing Mistral3ForConditionalGeneration or MistralCommonBackend.
Fix: follow the model cardâs guidance to use v5 RC/main and install mistral-common >= 1.8.6. (Hugging Face)
Pitfall 2: Tokenization mismatch
Symptom: model âsort of worksâ but has weird failures on logs, escaped strings, or format-heavy inputs.
Fix: use the Mistral tokenizer backend (mistral-common) as recommended, and treat edge cases seriously. (GitHub)
Pitfall 3: âLM-only LoRAâ accidentally hits vision
Symptom: you think you tuned only LM layers, but adapters attach to vision blocks too.
Fix: freeze vision, and do not rely only on target_modules. (Hugging Face Forums)
Pitfall 4: Serving differences (vLLM)
If you later serve with vLLM, do not assume you can pass a chat template per request for Mistral tokenizers. vLLM explicitly errors or warns about chat_template for Mistral tokenizers. (vLLM)
Practical fix: render the prompt text yourself (apply chat template client-side) before sending to vLLM.
If you truly want âno vision encoderâ at all
Your choices are:
- Stay official (recommended): ignore vision inputs, freeze vision weights. (Hugging Face)
- Use a third-party âTextOnlyâ conversion: higher compatibility, but it is not the original architecture and can differ. Example conversions exist. (Hugging Face)
- Pick a pure text-only base model (different model family). This is often simpler if you never need images.
Quick checklist for your case
- Use
mistralai/Ministral-3-3B-Instruct-2512-BF16for training. (Hugging Face) - Install Transformers v5 RC/main and
mistral-common >= 1.8.6. (Hugging Face) - Always format prompts with Mistral chat templates. (Mistral AI)
- Freeze vision weights. Do text-only examples. (Hugging Face Forums)
- Fine-tune with TRL SFTTrainer + PEFT LoRA, or QLoRA if VRAM is tight. (Hugging Face)
Summary bullets
- The checkpoint includes a vision encoder. You cannot âturn it offâ at the file level without conversion. (Hugging Face)
- You can use it as text-only by never passing images and freezing vision parameters. (Hugging Face Forums)
- Use Transformers v5 RC/main plus
mistral-common >= 1.8.6to avoid common setup and tokenization pitfalls. (Hugging Face) - Fine-tune with TRL
SFTTrainer+ PEFT LoRA, or QLoRA NF4 if VRAM is limited. (Hugging Face)
Thanks for information.
But when i load this text only model, its showing lm_head missing error
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend, GenerationConfig
model_id = âmistralai/Ministral-3-3B-Instruct-2512-BF16â
tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
Download complete:
0.00/0.00 [00:00<?, ?B/s]
Fetching 2 files: 100%
2/2 [00:00<00:00, 133.00it/s]
Loading weights: 100%
458/458 [00:21<00:00, 71.59it/s, Materializing param=model.vision_tower.transformer.layers.23.ffn_norm.weight]
Mistral3ForConditionalGeneration LOAD REPORT from: mistralai/Ministral-3-3B-Instruct-2512-BF16
Key | Status |
---------------+---------+-
lm_head.weight | MISSING |
Notes:
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu.
seems okay to ignore that error message�
Yes. Your lm_head.weight âMISSINGâ is usually expected for Ministral-3 3B. It is not a real load failure in most setups. It happens because the 3B variant uses tied embeddings, meaning the output head shares weights with the input token embedding, and checkpoints often do not store a separate lm_head.weight. (Hugging Face)
Below is what is going on, how to verify it, and how to run and fine-tune text-only without using vision.
1) First, this is not actually a âtext-only modelâ
mistralai/Ministral-3-3B-Instruct-2512* is a vision-language model (VLM): a ~3.4B language model plus a ~0.4B vision encoder. (Hugging Face)
Key point:
- Text-only usage is supported. You simply do not pass images. vLLMâs official recipe even has a âText only requestâ section showing that the same served model can be used with only text messages. (vLLM)
- But loading
Mistral3ForConditionalGenerationwill still load the vision tower weights unless you use a text-only-extracted checkpoint.
So you have two âtext-onlyâ meanings:
- Text-only inputs (keep vision weights loaded but unused). This is the official path.
- Text-only weights (do not load vision tower at all). This requires a text-only checkpoint or you extracting weights yourself.
2) Why lm_head.weight is âmissingâ for the 3B model
Ministral-3 3B is special:
- It uses tied embeddings (âshare the embedding and output layersâ). (vLLM)
- Model docs/notes explicitly say 3B has tied embeddings and âno output layerâ to reduce weights. (Hugging Face)
- The BF16 config for the model also indicates embedding tying in the text config (
tie_word_embeddings: true). (Hugging Face)
So a checkpoint can legitimately omit lm_head.weight. This is the same pattern you see in other HF models with tied heads (classic example: T5). (GitHub)
Why the loader prints a scary message anyway
Some loading paths print âMISSINGâ before weight tying is applied, or they warn even when it is safe. This confusion is a known theme across models and tools when tied parameters are involved. (GitHub)
3) What you should do right now (verify + make it safe)
A. Treat it as a warning unless generation is broken
If generation quality looks normal, it is probably fine.
B. Verify that output weights are actually tied
Run a pointer check. If tied, both tensors share storage.
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"
tok = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
).eval()
# Harmless even if already tied
model.tie_weights()
inp_emb = model.get_input_embeddings().weight
out_emb = model.get_output_embeddings().weight
print("tied?", inp_emb.data_ptr() == out_emb.data_ptr())
print("shapes:", inp_emb.shape, out_emb.shape)
Expected result for the 3B variant: tied? True.
C. If it is NOT tied, force it
This should not usually be needed, but it is the simplest fix when it happens:
model.config.tie_word_embeddings = True
model.tie_weights()
D. If you use device_map="auto" with CPU offload
Tied weights must live on compatible devices. If you later see device mismatch errors, load fully on one device (or CPU), tie, then move.
4) Text-only inference with the official VLM (no images)
You can do text-only generation by giving only text messages.
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-3B-Instruct-2512-BF16"
tok = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
).eval()
model.tie_weights()
messages = [
{"role": "system", "content": "You are a domain assistant. Answer using the company style guide."},
{"role": "user", "content": "Write a short incident report for a database failover."},
]
inputs = tok.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
)
# MistralCommonBackend may return a dict-like payload
if isinstance(inputs, dict):
inputs = {k: v.to(model.device) for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=200)
print(tok.decode(out[0], skip_special_tokens=True))
else:
inputs = inputs.to(model.device)
out = model.generate(inputs, max_new_tokens=200)
print(tok.decode(out[0], skip_special_tokens=True))
This uses no vision inputs, so the vision tower is unused at runtime, even though it is loaded. Text-only usage is a documented/expected workflow. (vLLM)
5) Fine-tuning for domain-specific text generation (without training vision)
Reality check: âI donât want the vision encoderâ
You likely mean one (or both):
- Do not train vision parameters. Easy.
- Do not load vision parameters. Harder unless you use a text-only checkpoint.
Option A (most common): keep the official checkpoint, freeze vision, train text only
This works well if you do LoRA/QLoRA SFT.
Freeze vision modules (typical names):
# After loading model...
for name, p in model.named_parameters():
if name.startswith("vision_tower"):
p.requires_grad = False
# Some VLMs also have a projector/adapter; freeze if present
if hasattr(model, "multi_modal_projector"):
for p in model.multi_modal_projector.parameters():
p.requires_grad = False
Then apply LoRA only to text transformer modules (common Mistral-style targets):
q_proj, k_proj, v_proj, o_projgate_proj, up_proj, down_proj
Pitfall: because this model is exposed as a conditional-generation VLM class in some stacks, some trainers that assume AutoModelForCausalLM can break (people hit this in evaluation harnesses). (GitHub)
Option B (cleanest): use a text-only extracted checkpoint
If your hard requirement is âno vision weights in memory at allâ, use a text-only extracted model.
One example: Aratako/Ministral-3-3B-Instruct-2512-TextOnly explicitly says it is the text-only component and can be loaded via AutoModelForCausalLM. (Hugging Face)
This makes fine-tuning straightforward because you are back in standard CausalLM tooling.
Tradeoffs:
- It is third-party packaging. You must validate outputs and licensing assumptions yourself (it claims Apache-2.0 and points back to the original). (Hugging Face)
- But it solves the âdonât load vision encoderâ requirement completely.
6) Common pitfalls and âsimilar issues onlineâ you should recognize
Pitfall 1: Tied embeddings confuse loaders, savers, and sharding tools
- âMissing
lm_head.weightâ warnings have shown up for years in tied-head models (T5 is the canonical case). (GitHub) - Some recipes recommend explicitly tying weights after load when resuming or manipulating checkpoints. (Hugging Face Forums)
Pitfall 2: Some distributed wrappers crash when a tied key is absent
Accelerate FSDP2 has had a real KeyError: 'lm_head.weight' class of failure when tied weights are involved. (GitHub)
If you see this during fine-tuning, workarounds include:
- avoid FSDP2 for now (use DDP/FSDP1/single-GPU),
- ensure tying is done consistently before wrapping,
- or use a model variant without tied embeddings (8B/14B). vLLM notes 8B/14B have separate embedding/output layers. (vLLM)
Pitfall 3: Version skew for new model families
Ministral-3 / Mistral3 landed recently in Transformers docs. (Hugging Face)
If you see weird behavior, check you are on a compatible Transformers build. There have also been v5-era Mistral3 issues around embedding tying logic. (GitHub)
7) High-quality docs, guides, issues to keep open while implementing
- Official model card (architecture, variants, intended use): (Hugging Face)
- Transformers âMinistral3â documentation (classes, usage examples): (Hugging Face)
- vLLM recipe for Ministral-3 (notes tied embeddings, shows text-only request): (vLLM)
- Mistral/Unsloth notes calling out âtied embeddings, no output layerâ for 3B: (Hugging Face)
- HF forum thread on missing
lm_head.weightduring resume (how people reason about it): (Hugging Face Forums) - Transformers issue explaining missing
lm_headin tied-head models (T5 example): (GitHub) - Accelerate issue where tied key absence causes crash in FSDP2: (GitHub)
- Text-only extracted checkpoint example: (Hugging Face)
âYour caseâ recommendation
If your priority is domain text generation fine-tuning and you want minimum risk:
-
Use BF16 weights (good for training). The main instruct model is often distributed in FP8 for inference. (Hugging Face)
-
Decide your hard constraint:
- If you can tolerate vision weights being loaded but unused: keep the official model, freeze vision, LoRA the text backbone.
- If you truly need âno vision encoder at allâ: start from a text-only extracted checkpoint and fine-tune as
AutoModelForCausalLM. (Hugging Face)
-
Donât panic about
lm_head.weightmissing on 3B. It matches âtied embeddings, no output layerâ. Verify tying with the pointer test and callmodel.tie_weights().
Summary
lm_head.weightâMISSINGâ is expected for Ministral-3 3B because it uses tied embeddings and may not store a separate output head. (Hugging Face)- For safety: run the âtied?â pointer check and call
model.tie_weights()after loading. - Text-only inference is supported by sending only text. (vLLM)
- To not load vision at all, use a text-only extracted checkpoint (or extract yourself). (Hugging Face)
Thanks a lot.
for inference 3B or 14B, model is generating gibberish. is there any thing missing in my code? Im adding code for reference. Please have a look.
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = âmistralai/Ministral-3-14B-Instruct-2512â
tokenizer = MistralCommonBackend.from_pretrained(model_id, trust_remote_code=True,
cache_dir="/content/huggingface_cache")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = ârightâ
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
# device_map=âautoâ,
device_map="sequential",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
cache_dir="/content/huggingface_cache",
low_cpu_mem_usage=True,
force_download=True,
).eval()
model.config.tie_word_embeddings = True
model.tie_weights()
inp_emb = model.get_input_embeddings().weight
out_emb = model.get_output_embeddings().weight
print(âtied?â, inp_emb.data_ptr() == out_emb.data_ptr())
print(âshapes:â, inp_emb.shape, out_emb.shape)
from transformers import TextStreamer, GenerationConfig
messages = [
{"role": "system", "content": "You are a AI domain assistant."},
{"role": "user", "content": "Create a podcast on elon musk. it must be a monologue and 20000 characters."},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
return_dict=True,
)
inputs_gen = {k: v.to(model.device) for k, v in inputs.items()}
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
gen_config = GenerationConfig(
max_new_tokens=20000,
do_sample=True,
temperature=0.15,
top_p=0.9,
)
with torch.no_grad():
out = model.generate(\*\*inputs_gen, generation_config=gen_config, streamer=streamer)
print(tokenizer.decode(out[0], skip_special_tokens=True))
ÙYou are a AI domain assistant.Create a podcast on elon musk. it must be a monologue and 20000 characters. Timeline hyperbolæ„/>avanjeèČ /> mac mac mac orderly flank flank flank flank flank reput Sent Sent flank flank popular popular popular popularfĂ€h macroscopic epidemic epidemic.end矩 competition competition ĐžŃĐżĐŸĐ»ŃĐ·Ń uĆŒy Course-INFinf minor epidemic àŠ« anisotropy repertoire repertoire TBăă TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB TB
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
/tmp/ipython-input-707871964.py in <cell line: 0>()
24
25 with torch.no_grad():
---> 26 out = model.generate(**inputs_gen, generation_config=gen_config, streamer=streamer)
27
28 print(tokenizer.decode(out[0], skip_special_tokens=True))
21 frames
/usr/local/lib/python3.12/dist-packages/transformers/models/ministral3/modeling_ministral3.py in forward(self, hidden_states)
208 hidden_states = hidden_states.to(torch.float32)
209 variance = hidden_states.pow(2).mean(-1, keepdim=True)
--> 210 hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
211 return self.weight * hidden_states.to(input_dtype)
212
KeyboardInterrupt:
The above setup was on an A100.
And I tried the same with Ollama locally and the Ministral models were generating just fine.
How can we fixed it?
Could it be because youâre using the fp8 version of the model weights?
Iâve confirmed that mistralai/Ministral-3-3B-Instruct-2512-BF16 runs fine on a T4 GPU environment. I couldnât test the 8B or 14B versions due to insufficient VRAMâŠ
Your code is missing two critical correctness rules for Ministral-3, and one practical rule for long generations. Fix those and the âgibberishâ problem usually disappears.
What is actually happening
1) You are forcing tied embeddings on 14B. That is wrong.
- 3B has tied embeddings (input embedding and output head share weights).
- 8B and 14B do not. They have separate embedding and output layers. (vLLM)
So this part in your 14B code is a bug:
model.config.tie_word_embeddings = True
model.tie_weights()
On 14B it can corrupt the output distribution and produce nonsense tokens.
2) You are using the FP8 checkpoint on an A100.
mistralai/Ministral-3-14B-Instruct-2512 is explicitly FP8. (Hugging Face)
There is a separate BF16 checkpoint: mistralai/Ministral-3-14B-Instruct-2512-BF16. (Hugging Face)
On A100, BF16 is the safest path. The FP8 checkpoint can be âconverted to BF16â in Transformers, but that conversion is a separate codepath and can be fragile across versions and configs. The HF discussion explicitly mentions this BF16 conversion path. (Hugging Face)
3) You are asking for max_new_tokens=20000 with streaming.
Even if everything is correct, that will be:
- very slow (especially if
device_map="sequential"causes CPU/GPU shuffling) - memory heavy due to KV cache growth
- likely to be interrupted (your KeyboardInterrupt is consistent with âitâs just taking foreverâ, not necessarily a crash)
Also, for chat-style generation, it is recommended to use chat templates properly and include a generation prompt when applicable. HF chat template docs recommend add_generation_prompt=True. (Hugging Face)
Fix in 30 seconds (14B on A100)
-
Switch to the BF16 checkpoint:
mistralai/Ministral-3-14B-Instruct-2512-BF16(Hugging Face) -
Remove the tie-weights lines entirely (14B must be untied). (vLLM)
-
Use
device_map="auto"(not"sequential") for sane placement and speed. -
Generate in chunks. Do not do 20000 tokens in one
generate()call.
Clean âknown-goodâ inference code for 14B BF16 on A100 (text-only)
This is the simplest baseline to verify correctness. It uses deterministic decoding first (no sampling) so you can tell if the model is loaded correctly.
# deps:
# pip install -U "transformers==5.0.0rc0" "mistral-common>=1.8.6" accelerate torch
# refs:
# - 14B BF16 model card: https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512-BF16
# - 14B FP8 model card (shows FP8): https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512
# - 3B tied vs 14B not tied: https://docs.vllm.ai/projects/recipes/en/latest/Mistral/Ministral-3-Instruct.html
# - chat templates guidance: https://huggingface.co/docs/transformers/en/chat_templating
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-14B-Instruct-2512-BF16"
tok = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
).eval()
# IMPORTANT: do NOT tie weights for 14B (only 3B is tied)
# model.config.tie_word_embeddings = True # WRONG for 14B
# model.tie_weights() # WRONG for 14B
messages = [
{"role": "system", "content": "You are an AI domain assistant."},
{"role": "user", "content": "Write a short incident report for a database failover. Use 5 bullet points."},
]
inputs = tok.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=220,
do_sample=False,
)
# Decode only the newly generated tokens (avoid re-printing the prompt)
prompt_len = inputs["input_ids"].shape[1]
new_tokens = out[0, prompt_len:]
print(tok.decode(new_tokens, skip_special_tokens=True))
If this baseline is coherent, your âgibberishâ was caused by your previous loading choices, not the model.
If you insist on using the FP8 checkpoint on A100
You can, but treat it as âadvanced modeâ.
- The model card says it is FP8. (Hugging Face)
- The repo discussion mentions Transformers can convert the checkpoint to BF16 and references
FineGrainedFP8Config. (Hugging Face)
If you do not need FP8 specifically, do not do this. Use BF16.
Fixing your 3B path (if 3B also looks bad)
For 3B:
- tied embeddings are expected. (vLLM)
- prefer BF16 checkpoint on A100:
...-3B-Instruct-2512-BF16(Hugging Face) - use the same âdecode only new tokensâ pattern
Common âlooks like gibberish but isnâtâ cases:
- you decode the entire sequence and see the prompt plus completion
- you stream tokens and then decode again, so it looks duplicated and messy
How to generate â20000 charactersâ without going off the rails
Do not do max_new_tokens=20000 in one call. Instead:
- Generate 512â2048 tokens per step.
- Append a short trailing window (last 256â512 tokens) back into the next prompt.
- Stop when character count is reached.
This avoids huge KV cache growth and avoids the âit runs forever so I interrupt itâ pattern.
Why Ollama âworks fineâ while Transformers gave junk
Ollama typically uses:
- a converted runtime format and a fixed known-good prompt template
- a stable quantized inference path
Your Transformers run combined:
- a FP8 checkpoint being coerced to BF16
- forced weight tying on a model that should not be tied (14B)
- very long generation with a slow device map
Those are exactly the kinds of differences that produce ânonsense tokensâ in one stack and âfineâ output in another.
Minimal checklist
- Use BF16 checkpoints for A100 when available. (
âŠ-BF16) (Hugging Face) - For 14B: do not tie embeddings. (vLLM)
- Use
device_map="auto"not"sequential". - Use
add_generation_prompt=Truewith chat templates. (Hugging Face) - Decode only newly generated tokens.
- For huge outputs, generate in chunks.
Thanks for the info â that solved my issue! Really appreciate the quick help.
Next, Iâll try fine-tuning the 14B model on some domain-specific data using the template shared earlier and see how it performs.
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-14B-Instruct-2512-BF16"
tokenizer = MistralCommonBackend.from_pretrained(model_id, cache_dir="/content/huggingface_cache")
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
cache_dir="/content/huggingface_cache",
)
def freeze_non_lm(model):
for name, param in model.named_parameters():
# Only train language model layers
if not name.startswith("model.language_model") and not name.startswith("lm_head"):
param.requires_grad = False
freeze_non_lm(model)
# Verify
for name, param in model.named_parameters():
print(name, param.requires_grad)
# tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# LoRA config (common target modules)
# TRL guidance explains typical LoRA params and target_modules choices. :contentReference[oaicite:13]{index=13}
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)
def create_text(example):
msgs = [
{"role": "system", "content": example["system"]},
{"role": "user", "content": example["user"]},
{"role": "assistant", "content": example["assistant"]},
]
# return {"text": tokenizer.apply_chat_template(
# msgs,
# tokenize=True,
# continue_final_message=True,
# )}
# Just create the chat string without tokenizing
text = tokenizer.apply_chat_template(msgs, tokenize=True, continue_final_message=True)
return {"text": text}
ds = ds.map(create_text)
ds
from transformers import DataCollatorForSeq2Seq
# Training arguments
cfg = SFTConfig(
output_dir="ministral3_domain_lora",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
num_train_epochs=1,
bf16=True,
max_length=2048,
packing=False,
logging_steps=10,
save_steps=200,
report_to="none",
# push_to_hub=True,
)
# Data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
trainer = SFTTrainer(
model=model,
train_dataset=ds,
peft_config=peft_config,
args=cfg,
formatting_func=create_text,
data_collator=data_collator
)
trainer.train()
# trainer.save_model()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipython-input-2362894520.py in <cell line: 0>()
23 )
24
---> 25 trainer = SFTTrainer(
26 model=model,
27 train_dataset=ds,
7 frames
/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py in add_eos(example, eos_token)
972
973 def add_eos(example, eos_token):
--> 974 if "text" in example and not example["text"].endswith(eos_token): # language modeling case
975 example["text"] = example["text"] + eos_token
976 elif "completion" in example and not example["completion"].endswith(eos_token):
AttributeError: 'dict' object has no attribute 'endswith'
Adding EOS to train dataset: 0%
0/230 [00:00<?, ? examples/s]
I encountered an error related to eos. Could you help me understand how to fix it? Also, could you check if my training script is correct?
And if we set True for packing, its show the error
ValueError Traceback (most recent call last)
/tmp/ipython-input-2632563400.py in <cell line: 0>()
23 )
24
---> 25 trainer = SFTTrainer(
26 model=model,
27 train_dataset=ds,
/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizers, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func)
675 # Catch some wrong configurations related to VLMs
676 if self._is_vlm and args.packing:
--> 677 raise ValueError(
678 "Packing is not supported for vision-language models. Please set `packing=False` in the SFTConfig."
679 )
ValueError: Packing is not supported for vision-language models. Please set `packing=False` in the SFTConfig.
maybe:
Why you get the EOS error (endswith on a dict)
trl.SFTTrainer has a dataset-prep step that (for non-conversational datasets) appends an EOS token by doing:
example["text"].endswith(eos_token)(language-modeling âtextâ datasets), orexample["completion"].endswith(eos_token)(prompt/completion âstringâ datasets)
If example["text"] is not a Python string, this crashes with:
AttributeError: 'dict' object has no attribute 'endswith'
Thatâs whatâs happening in your script for two reasons:
-
Your
create_text()returns tokenized output (a dict-like object), not a string
apply_chat_template(..., tokenize=True, ...)returns tokenized output, whereastokenize=Falsereturns a string. (Hugging Face) -
Youâre applying formatting twice
You dods = ds.map(create_text)(adds a"text"column), and then you also passformatting_func=create_texttoSFTTrainer. In TRL,formatting_funcmust return a string, and TRL wraps it as{"text": formatting_func(example)}. (Hugging Face)
Since yourcreate_text()itself returns{"text": ...}, you end up with nested dicts, and then the EOS appender hits.endswith()on a dict. (GitHub)
Fixing the EOS error (pick one clean approach)
Option A (minimal change): keep formatting_func, but return a string and donât pre-map
- Remove
ds = ds.map(create_text) - Make
formatting_funcreturn a string (usetokenize=False) - Donât return
{"text": ...}from the formatting function
def formatting_func(example):
msgs = [
{"role": "system", "content": example["system"]},
{"role": "user", "content": example["user"]},
{"role": "assistant", "content": example["assistant"]},
]
# IMPORTANT: tokenize=False -> returns a STRING
return tokenizer.apply_chat_template(msgs, tokenize=False)
trainer = SFTTrainer(
model=model,
train_dataset=ds,
processing_class=tokenizer, # important for your packing issue; see below
peft_config=peft_config,
args=cfg,
formatting_func=formatting_func,
)
Why this works:
- TRL expects
formatting_func: Callable[[dict], str]. (Hugging Face) tokenize=Falsereturns a string. (Hugging Face)
Option B: precompute a "text" column (string) and donât pass formatting_func
def create_text(example):
msgs = [
{"role": "system", "content": example["system"]},
{"role": "user", "content": example["user"]},
{"role": "assistant", "content": example["assistant"]},
]
return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}
ds = ds.map(create_text)
trainer = SFTTrainer(
model=model,
train_dataset=ds,
processing_class=tokenizer,
peft_config=peft_config,
args=cfg,
# no formatting_func
)
Option C (often best for chat SFT): use a conversational "messages" column
This avoids the âappend EOS to raw textâ path and lets TRL treat the dataset as conversational.
def to_messages(example):
return {
"messages": [
{"role": "system", "content": example["system"]},
{"role": "user", "content": example["user"]},
{"role": "assistant", "content": example["assistant"]},
]
}
ds = ds.map(to_messages)
cfg = SFTConfig(
output_dir="ministral3_domain_lora",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
num_train_epochs=1,
bf16=True,
max_length=2048,
packing=False, # or True after fixing VLM detection (next section)
assistant_only_loss=True, # optional, but common for chat SFT
report_to="none",
)
trainer = SFTTrainer(
model=model,
train_dataset=ds,
processing_class=tokenizer,
peft_config=peft_config,
args=cfg,
)
TRL explicitly documents âstandardâ (text) vs âconversationalâ (messages) dataset formats, and the trainer signature shows formatting_func returns a str. (Hugging Face)
Why packing=True raises ânot supported for vision-language modelsâ
Your checkpoint is a vision-language model (VLM) (it includes vision capability even if you only use text). (Hugging Face)
In TRL, if you donât pass processing_class=..., it calls AutoProcessor.from_pretrained(...). If that returns a ProcessorMixin, TRL sets _is_vlm = True, and then it hard-errors on packing=True:
âPacking is not supported for vision-language models.â (GitHub)
How to enable packing for text-only training
If your dataset truly has no images and you want packing, you can usually force TRL into text-only mode by passing the tokenizer as processing_class:
trainer = SFTTrainer(
model=model,
train_dataset=ds,
processing_class=tokenizer, # <- makes TRL treat it as text (PreTrainedTokenizerBase)
peft_config=peft_config,
args=SFTConfig(packing=True, ...),
)
This works because TRL decides âVLM vs LLMâ primarily from the type of processing_class (ProcessorMixin => VLM; TokenizerBase => not VLM). (GitHub)
Script audit: other issues / improvements
1) Your comment says âcreate string without tokenizingâ but you used tokenize=True
That returns tokenized outputs (dict-like), not a string. Use tokenize=False for strings. (Hugging Face)
2) Donât use continue_final_message=True for standard supervised chat examples
continue_final_message=True explicitly formats the last message as open-ended without EOS tokens (meant for prefilling a partial assistant response). For typical SFT rows where the assistant answer is complete, itâs better left off. (Hugging Face)
3) You imported the wrong collator and didnât import the one you use
You imported DataCollatorForSeq2Seq but instantiate DataCollatorForLanguageModeling. In most TRL SFT cases you can simply omit data_collator and let TRL choose (it picks an LM collator vs VLM collator automatically). (Hugging Face)
4) Manual freezing is usually unnecessary when using PEFT via peft_config=...
SFTTrainer will wrap your model with get_peft_model() when peft_config is provided. (GitHub)
If you do want âonly LoRA trainsâ, verify trainable params after trainer init (PEFT provides helpers), but avoid relying on name-prefix heuristics unless youâre sure about module names.
5) Prefer processing_class=tokenizer for this family right now
There are known cases where AutoProcessor / AutoTokenizer paths behave differently from MistralCommonBackend for Ministral tokenization; people have reported cleaner behavior using MistralCommonBackend directly. (GitHub)
Similar/related issues and good references (with context)
-
TRL SFTTrainer docs (dataset formats, packing, assistant-only loss, signature of
formatting_func) (Hugging Face)
Use this to align your dataset shape (âtextâ vs âmessagesâ vs âprompt/completionâ) and understand when packing is appropriate. -
TRL implementation showing why your two errors occur (VLM detection, packing restriction, and the
.endswith(eos)step) (GitHub)
This is the exact logic behind both stack traces. -
Transformers tokenizer
apply_chat_templatecontract (string vs tokenized output;continue_final_messagemeaning) (Hugging Face)
This explains whytokenize=Truegave you a dict-like object. -
Transformers issue involving Ministral tokenization/processor differences (AutoProcessor/AutoTokenizer vs
MistralCommonBackend) (GitHub)
Relevant because TRL defaults toAutoProcessorunless you overrideprocessing_class. -
TRL issue about VLM handling (context: VLMs and trainer expectations differ from pure CausalLM workflows) (GitHub)
-
Model background: Mistral 3 family adds vision (Hugging Face)
Explains why tooling often classifies these checkpoints as VLMs even when you feed text only.
If you apply Option A/B/C and also pass processing_class=tokenizer, you should eliminate both the EOS crash and the packing error (assuming your dataset has no image columns).