If you can use the nightly build of vLLM, you might be able to bypass the restrictions?
1) What your endpoint error actually means
Your failure happens before any real “GPU sizing / memory” problems come into play.
vLLM is trying to build a ModelConfig by asking
Transformers to interpret the model’s config.json. That config declares a new architecture identifier:
model_type: "qwen3_5" (in the official checkpoint’s config)
If the Transformers version inside the Inference Endpoint container doesn’t know what qwen3_5 is, AutoConfig can’t map it to a model class, and vLLM aborts with exactly the validation error you see.
That’s consistent with Hugging Face’s own Inference Endpoints runtime docs showing Transformers 4.48.0 in at least some images—well before Qwen3.5 support existed upstream. (Hugging Face)
So yes: your hypothesis is very plausible—the container image you’re on is shipping a Transformers build that predates Qwen3.5 support.
2) Why this is happening “now”: Qwen3.5 support landed extremely recently
Two key timeline facts:
- The Transformers PR that adds Qwen3.5 support (“Adding Support for Qwen3.5”) was merged on Feb 9, 2026. (GitHub)
- The Transformers release notes show Qwen3.5 support called out in v5.2.0 (Feb 2026) and mention installing with
pip install transformers --pre for the v5 release candidates. (GitHub)
In other words, Qwen3.5 is in the awkward window where:
- the official weights are published, but
- many serving images are still pinned to older Transformers builds, and
- even if you upgrade Transformers, your serving stack may have version constraints (see below).
3) Why “just upgrade Transformers” is not always trivial with vLLM (especially in managed containers)
In managed environments, you typically inherit whatever the image pins. vLLM has historically pinned Transformers in ways that can lag brand-new architectures; there are recent vLLM issues where models require Transformers versions that aren’t compatible with vLLM’s current constraints. (GitHub)
Also, vLLM itself is in flux for Qwen3.5: the vLLM team’s own Qwen3.5 recipe says to use vLLM nightly “until 0.17.0 is released,” which is a strong signal that stable releases may not yet cover all Qwen3.5 edges. (vLLM)
So on Inference Endpoints, unless you can:
- switch to an image that already includes the needed Transformers commit, or
- install Transformers-from-source inside the container, or
- bring a custom container,
…you can get stuck exactly where you are.
4) Is Unsloth’s GGUF a viable alternative for your use case (post-editing / rewriting)? Yes—with specific caveats
4.1 What is actually different between “official” and “Unsloth GGUF”
For your text-only post-editing workflow, the meaningful differences are usually:
- Weight format + quantization
- Official repo: typically BF16/FP16 weights loaded via Transformers/vLLM.
- GGUF repo: weights converted for llama.cpp, usually quantized (Q4/Q5/Q6/Q8, plus “UD-” variants).
Unsloth’s Qwen3.5-27B-GGUF repo explicitly provides multiple quantizations (e.g., Q4_K_M, Q5_K_M, Q6_K, Q8_0, plus UD variants). (Hugging Face)
- Inference engine
- Official on vLLM: GPU-first serving, high throughput under concurrency.
- GGUF on llama.cpp: optimized for portability and efficiency, often excellent on single node / smaller GPUs / CPU-offload.
- Multimodal handling (only matters if you use vision)
Unsloth includes an mmproj file (projection weights for multimodal in llama.cpp) alongside the GGUFs. (Hugging Face)
If you’re purely doing text post-editing, you can ignore multimodal.
4.2 The caveat that matters most for post-editing: quantization can change “style obedience”
Your task (“rewrite the given translation into a specified style, obey vocabulary/glossary rules”) is sensitive to small model-quality regressions. Quantization can:
- slightly reduce instruction fidelity,
- increase minor wording drift,
- weaken consistency on strict terminology.
Practical implication: if you stay on GGUF, prefer higher-quality quants:
- Q8_0 (highest fidelity, largest)
- Q6_K (often a strong quality/size trade)
- be cautious with Q4 variants if your style guide is strict.
(You don’t need to guess—run an A/B test on your real post-edit set; see §7.)
4.3 “Thinking mode” / verbosity differences can bite text-editing pipelines
Some Qwen3.5 builds expose “thinking vs non-thinking” behavior. If your pipeline expects only the final rewritten text, you must ensure the runtime isn’t emitting internal reasoning or long “thinking” blocks.
Unsloth’s llama.cpp instructions show using --chat-template-kwargs "{\"enable_thinking\": false}" for Qwen3.5. (Unsloth)
There are also community reports of “still thinking” behavior in some setups, so validate your exact llama.cpp build + template behavior early. (Hugging Face)
5) vLLM vs llama.cpp for post-editing: what differs in practice
Here’s the decision in the dimensions that matter for translation post-editing.
| Dimension |
vLLM (Transformers weights) |
llama.cpp (GGUF) |
| Output quality ceiling |
Highest (BF16/FP16, minimal approximation) |
Depends on quant (Q8≈close; Q4 can be noticeable) (Hugging Face) |
| Throughput under concurrency |
Typically excellent (batched serving) |
Often less optimized for many concurrent users (still workable) (Red Hat Developer) |
| Operational friction today |
Blocked for you due to missing qwen3_5 support in the image |
Works now (you already deployed) (Hugging Face) |
| Long-context performance |
Often better once fully supported |
Can be limited/slow depending on arch support; Qwen3.5 uses hybrid/linear-attn mechanisms that may have performance cliffs (qwen.readthedocs.io) |
| Best use case |
Production, high volume, multi-tenant, GPU-rich |
“Get it running” deployments, smaller GPU budgets, portable inference, quick iteration (Red Hat Developer) |
Two extra notes:
- Qwen’s own materials explicitly mention using frameworks like vLLM for deployment and also note llama.cpp supports Qwen3.5 (text & vision) and to look for GGUF models. (GitHub)
- vLLM’s own Qwen3.5 recipe currently recommends nightly builds, indicating rapid stabilization. (vLLM)
6) So… should you continue with Unsloth GGUF or wait for official vLLM?
You’re likely OK to continue with Unsloth if:
- you can use Q6_K or Q8_0, and
- you verify that “thinking” is disabled and outputs are clean, and
- your workload is not heavily multi-tenant/high-concurrency right now.
For many post-editing pipelines, that’s enough to be production-viable.
You should push for official (or a custom vLLM image) if you need:
- maximum style fidelity (especially strict terminology/glossary adherence),
- high concurrency throughput,
- very long inputs (large context windows) with predictable speed,
- or multimodal features in a standardized serving stack.
Concrete path (if you control the container):
- Use a vLLM build aligned with the Qwen3.5 guidance (nightly per vLLM’s recipe). (vLLM)
- Use a Transformers build that includes Qwen3.5 support (post-merge PR / v5.2.0+). (GitHub)
In fully managed images, this typically means “switch image” or “custom container,” not just toggling GPU size.
7) How to evaluate “major differences” for your exact post-editing task (fast, reliable)
Run a small bake-off that reflects your real constraints.
7.1 Build a test set (30–200 items)
For each item keep:
- Source text
- Machine translation
- Style guide excerpt + glossary/forbidden terms
- Gold post-edit (if you have it) or at least a human rating rubric
7.2 Compare these conditions
- Official (when possible): BF16/FP16 on vLLM
- GGUF: Q8_0 and Q6_K (optionally Q5_K_M)
7.3 Measure what matters for post-editing
- Terminology accuracy (glossary terms always used, forbidden terms never used)
- Meaning preservation (human check or targeted heuristics)
- Style compliance (human rubric, or pattern checks if style is formalized)
- Editing stability (does it rewrite only what’s needed vs over-edit?)
If Q6/Q8 GGUF matches your thresholds, you have a defensible “good enough now” solution.
8) Practical recommendations if you stick with GGUF (llama.cpp)
- Prefer Q6_K / Q8_0 for style-sensitive rewriting. (Hugging Face)
- Force non-thinking outputs via the template kwargs and validate with a unit test (one request, assert no hidden reasoning text). (Unsloth)
- Use conservative decoding for post-editing:
- temperature ~0–0.3 (or equivalent), avoid overly creative sampling
- keep repetition penalties modest (post-editing often needs repeated terminology)
- Keep prompts “post-editor shaped”:
- explicitly: “Rewrite ONLY the provided translation; do not add content; preserve meaning; follow glossary; output final text only.”
Bottom line
- Your Inference Endpoints failure is consistent with a Transformers-too-old situation for the new
qwen3_5 architecture. (Hugging Face)
- Unsloth GGUF + llama.cpp is a viable path for machine-translation post-editing, with the main tradeoff being quantization + template/runtime behavior, not a fundamentally different model. (Hugging Face)
- If you need maximum fidelity + high-throughput serving, aim for a vLLM stack aligned with the now-landed upstream support (Transformers v5.2.0+/post-merge and vLLM builds that match Qwen3.5 guidance). (GitHub)