Seeking Advice: Qwen3.5-27B failing on Inference Endpoints — is Unsloth GGUF a viable alternative for text editing?

Hi Everyone,
I’m interested in the Qwen3.5-27B model for machine translation post-editing, which would basically involve rewriting pre-provided machine translations into a specified style, following specific style and vocabulary guidelines. I tried an online deployment on Inference Endpoints by Hugging Face using the official Qwen3.5-27B repo and a number of different GPU configurations, including the recommended A100 2xGPU 160GB setup. No matter what configuration I tried, I was hit with the same Error:

Endpoint failed to start | Check Logs

Exit code: 1. Reason: ;0m vllm_config = engine_args.create_engine_config(usage_context=usage_context) e[0;36m(APIServer pid=1)e[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ e[0;36m(APIServer pid=1)e[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1369, in create_engine_config e[0;36m(APIServer pid=1)e[0;0m model_config = self.create_model_config() e[0;36m(APIServer pid=1)e[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^ e[0;36m(APIServer pid=1)e[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1223, in create_model_config e[0;36m(APIServer pid=1)e[0;0m return ModelConfig( e[0;36m(APIServer pid=1)e[0;0m ^^^^^^^^^^^^ e[0;36m(APIServer pid=1)e[0;0m File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__ e[0;36m(APIServer pid=1)e[0;0m s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s) e[0;36m(APIServer pid=1)e[0;0m pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig e[0;36m(APIServer pid=1)e[0;0m Value error, The checkpoint you are trying to load has model type `qwen3_5` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date. e[0;36m(APIServer pid=1)e[0;0m e[0;36m(APIServer pid=1)e[0;0m You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git` [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs] e[0;36m(APIServer pid=1)e[0;0m For further information visit https://errors.pydantic.dev/2.12/v/value_error

Based on the error, could it be that the vLLM container image used by Inference Endpoints ships with a version of Transformers that doesn’t yet support qwen3_5?. When looking for alternatives, I came across the Unsloth Qwen3.5-27B-GGUF repo, which runs on llama.cpp and deployed just fine.

My question is whether there would be any major differences between the official repo and the Unsloth repo for my use case? Or whether there’s any significant difference between vLLM and llama.cpp for this use case. I’m trying to find out whether I’d be OK to continue with the Unsloth repo or whether I should just wait until the official repo can deploy.

1 Like

If you can use the nightly build of vLLM, you might be able to bypass the restrictions?


1) What your endpoint error actually means

Your failure happens before any real “GPU sizing / memory” problems come into play.

vLLM is trying to build a ModelConfig by asking :hugs: Transformers to interpret the model’s config.json. That config declares a new architecture identifier:

  • model_type: "qwen3_5" (in the official checkpoint’s config)

If the Transformers version inside the Inference Endpoint container doesn’t know what qwen3_5 is, AutoConfig can’t map it to a model class, and vLLM aborts with exactly the validation error you see.

That’s consistent with Hugging Face’s own Inference Endpoints runtime docs showing Transformers 4.48.0 in at least some images—well before Qwen3.5 support existed upstream. (Hugging Face)

So yes: your hypothesis is very plausible—the container image you’re on is shipping a Transformers build that predates Qwen3.5 support.


2) Why this is happening “now”: Qwen3.5 support landed extremely recently

Two key timeline facts:

  • The Transformers PR that adds Qwen3.5 support (“Adding Support for Qwen3.5”) was merged on Feb 9, 2026. (GitHub)
  • The Transformers release notes show Qwen3.5 support called out in v5.2.0 (Feb 2026) and mention installing with pip install transformers --pre for the v5 release candidates. (GitHub)

In other words, Qwen3.5 is in the awkward window where:

  • the official weights are published, but
  • many serving images are still pinned to older Transformers builds, and
  • even if you upgrade Transformers, your serving stack may have version constraints (see below).

3) Why “just upgrade Transformers” is not always trivial with vLLM (especially in managed containers)

In managed environments, you typically inherit whatever the image pins. vLLM has historically pinned Transformers in ways that can lag brand-new architectures; there are recent vLLM issues where models require Transformers versions that aren’t compatible with vLLM’s current constraints. (GitHub)

Also, vLLM itself is in flux for Qwen3.5: the vLLM team’s own Qwen3.5 recipe says to use vLLM nightly “until 0.17.0 is released,” which is a strong signal that stable releases may not yet cover all Qwen3.5 edges. (vLLM)

So on Inference Endpoints, unless you can:

  • switch to an image that already includes the needed Transformers commit, or
  • install Transformers-from-source inside the container, or
  • bring a custom container,

…you can get stuck exactly where you are.


4) Is Unsloth’s GGUF a viable alternative for your use case (post-editing / rewriting)? Yes—with specific caveats

4.1 What is actually different between “official” and “Unsloth GGUF”

For your text-only post-editing workflow, the meaningful differences are usually:

  1. Weight format + quantization
  • Official repo: typically BF16/FP16 weights loaded via Transformers/vLLM.
  • GGUF repo: weights converted for llama.cpp, usually quantized (Q4/Q5/Q6/Q8, plus “UD-” variants).

Unsloth’s Qwen3.5-27B-GGUF repo explicitly provides multiple quantizations (e.g., Q4_K_M, Q5_K_M, Q6_K, Q8_0, plus UD variants). (Hugging Face)

  1. Inference engine
  • Official on vLLM: GPU-first serving, high throughput under concurrency.
  • GGUF on llama.cpp: optimized for portability and efficiency, often excellent on single node / smaller GPUs / CPU-offload.
  1. Multimodal handling (only matters if you use vision)
    Unsloth includes an mmproj file (projection weights for multimodal in llama.cpp) alongside the GGUFs. (Hugging Face)
    If you’re purely doing text post-editing, you can ignore multimodal.

4.2 The caveat that matters most for post-editing: quantization can change “style obedience”

Your task (“rewrite the given translation into a specified style, obey vocabulary/glossary rules”) is sensitive to small model-quality regressions. Quantization can:

  • slightly reduce instruction fidelity,
  • increase minor wording drift,
  • weaken consistency on strict terminology.

Practical implication: if you stay on GGUF, prefer higher-quality quants:

  • Q8_0 (highest fidelity, largest)
  • Q6_K (often a strong quality/size trade)
  • be cautious with Q4 variants if your style guide is strict.

(You don’t need to guess—run an A/B test on your real post-edit set; see §7.)

4.3 “Thinking mode” / verbosity differences can bite text-editing pipelines

Some Qwen3.5 builds expose “thinking vs non-thinking” behavior. If your pipeline expects only the final rewritten text, you must ensure the runtime isn’t emitting internal reasoning or long “thinking” blocks.

Unsloth’s llama.cpp instructions show using --chat-template-kwargs "{\"enable_thinking\": false}" for Qwen3.5. (Unsloth)
There are also community reports of “still thinking” behavior in some setups, so validate your exact llama.cpp build + template behavior early. (Hugging Face)


5) vLLM vs llama.cpp for post-editing: what differs in practice

Here’s the decision in the dimensions that matter for translation post-editing.

Dimension vLLM (Transformers weights) llama.cpp (GGUF)
Output quality ceiling Highest (BF16/FP16, minimal approximation) Depends on quant (Q8≈close; Q4 can be noticeable) (Hugging Face)
Throughput under concurrency Typically excellent (batched serving) Often less optimized for many concurrent users (still workable) (Red Hat Developer)
Operational friction today Blocked for you due to missing qwen3_5 support in the image Works now (you already deployed) (Hugging Face)
Long-context performance Often better once fully supported Can be limited/slow depending on arch support; Qwen3.5 uses hybrid/linear-attn mechanisms that may have performance cliffs (qwen.readthedocs.io)
Best use case Production, high volume, multi-tenant, GPU-rich “Get it running” deployments, smaller GPU budgets, portable inference, quick iteration (Red Hat Developer)

Two extra notes:

  • Qwen’s own materials explicitly mention using frameworks like vLLM for deployment and also note llama.cpp supports Qwen3.5 (text & vision) and to look for GGUF models. (GitHub)
  • vLLM’s own Qwen3.5 recipe currently recommends nightly builds, indicating rapid stabilization. (vLLM)

6) So… should you continue with Unsloth GGUF or wait for official vLLM?

You’re likely OK to continue with Unsloth if:

  • you can use Q6_K or Q8_0, and
  • you verify that “thinking” is disabled and outputs are clean, and
  • your workload is not heavily multi-tenant/high-concurrency right now.

For many post-editing pipelines, that’s enough to be production-viable.

You should push for official (or a custom vLLM image) if you need:

  • maximum style fidelity (especially strict terminology/glossary adherence),
  • high concurrency throughput,
  • very long inputs (large context windows) with predictable speed,
  • or multimodal features in a standardized serving stack.

Concrete path (if you control the container):

  • Use a vLLM build aligned with the Qwen3.5 guidance (nightly per vLLM’s recipe). (vLLM)
  • Use a Transformers build that includes Qwen3.5 support (post-merge PR / v5.2.0+). (GitHub)
    In fully managed images, this typically means “switch image” or “custom container,” not just toggling GPU size.

7) How to evaluate “major differences” for your exact post-editing task (fast, reliable)

Run a small bake-off that reflects your real constraints.

7.1 Build a test set (30–200 items)

For each item keep:

  • Source text
  • Machine translation
  • Style guide excerpt + glossary/forbidden terms
  • Gold post-edit (if you have it) or at least a human rating rubric

7.2 Compare these conditions

  • Official (when possible): BF16/FP16 on vLLM
  • GGUF: Q8_0 and Q6_K (optionally Q5_K_M)

7.3 Measure what matters for post-editing

  • Terminology accuracy (glossary terms always used, forbidden terms never used)
  • Meaning preservation (human check or targeted heuristics)
  • Style compliance (human rubric, or pattern checks if style is formalized)
  • Editing stability (does it rewrite only what’s needed vs over-edit?)

If Q6/Q8 GGUF matches your thresholds, you have a defensible “good enough now” solution.


8) Practical recommendations if you stick with GGUF (llama.cpp)

  1. Prefer Q6_K / Q8_0 for style-sensitive rewriting. (Hugging Face)
  2. Force non-thinking outputs via the template kwargs and validate with a unit test (one request, assert no hidden reasoning text). (Unsloth)
  3. Use conservative decoding for post-editing:
  • temperature ~0–0.3 (or equivalent), avoid overly creative sampling
  • keep repetition penalties modest (post-editing often needs repeated terminology)
  1. Keep prompts “post-editor shaped”:
  • explicitly: “Rewrite ONLY the provided translation; do not add content; preserve meaning; follow glossary; output final text only.”

Bottom line

  • Your Inference Endpoints failure is consistent with a Transformers-too-old situation for the new qwen3_5 architecture. (Hugging Face)
  • Unsloth GGUF + llama.cpp is a viable path for machine-translation post-editing, with the main tradeoff being quantization + template/runtime behavior, not a fundamentally different model. (Hugging Face)
  • If you need maximum fidelity + high-throughput serving, aim for a vLLM stack aligned with the now-landed upstream support (Transformers v5.2.0+/post-merge and vLLM builds that match Qwen3.5 guidance). (GitHub)

Thanks for the advice. I tried using the nightly build on a dedicated inference endpoint using 1x Nvidia H200 141GB. It seemed to be working, though I gave up after an hour of the endponint Initializing with a steady 131GB VRAM consumption. It seems like sticking with llama.cpp and GGUF might be the best option for my use case + a simple setup, at least until Transformers is updated in Inference Endpoints.

1 Like