I think youâre mismatching the model format youâre loading with the software. Basically, just think of GGUF as for Llama.cpp or Ollama, and everything else as for Transformers.
While it is indeed possible to load GGUF with Transformers, this is only for extremely specialized use cases and offers no particular benefit for general useâŚ
Use non-GGUF model weights with Transformers, and if VRAM is insufficient, employ quantization like bitsandbytes or TorchAO.
What model_type is (and why youâre seeing this error)
In Transformers, AutoModel.from_pretrained(...) first loads a config.json, then uses config["model_type"] to choose which model class to instantiate (e.g., LlamaForCausalLM, Qwen2ForCausalLM, etc.). If there is no config (or no model_type), you get the exact error you saw. (Hugging Face)
So, you generally do not âspecify model_type in Python codeâ. You provide it via the modelâs config.json.
Why Qwen/Qwen3-VL-8B-Instruct-GGUF fails with AutoModel
That repository is a GGUF distribution, not a standard Transformers checkpoint repo.
The model card says it contains GGUF weights split into two components:
- Language model (LLM) GGUF
- Vision encoder (
mmproj) GGUF
âŚand itâs intended for llama.cpp / other GGUF-based tools. (Hugging Face)
Because of that layout, Transformersâ normal AutoModel path (expecting a Transformers-style config + weights) doesnât apply.
Even if you âadd a config.jsonâ, it still wonât do what you want
Transformers does support loading some GGUF models, but:
- Transformers GGUF loading is for training/finetuning workflows: it explicitly says the GGUF checkpoint is dequantized to fp32 so it becomes standard PyTorch weights. (Hugging Face)
- It only works for a specific set of architectures (examples listed include Llama, Mistral, Qwen2, etc.), and you must pass
gguf_file=... to from_pretrained(). (Hugging Face)
- Qwen3-VL GGUF commonly hits âarchitecture ⌠not supported yetâ in Transformers/vLLM paths (end-user reports exist for
qwen3vl and qwen3vlmoe). (Hugging Face)
Also, Transformersâ GGUF loader expects a single GGUF in the usual text-model sense; Qwen3-VL GGUF needs both LLM + mmproj to function as a multimodal model. (Hugging Face)
So: creating/overriding model_type wonât turn this GGUF repo into a Transformers-loadable VL model.
What the model_type would be for Qwen3-VL (for reference)
In the Transformers-native Qwen3-VL repo (Qwen/Qwen3-VL-8B-Instruct), the config contains:
"model_type": "qwen3_vl"
"architectures": ["Qwen3VLForConditionalGeneration"] (Hugging Face)
Thatâs the correct model_type for Qwen3-VL in Transformers.
But again: thatâs for the non-GGUF checkpoint repo, not the GGUF repo.
Correct solutions (pick the one that matches your goal)
Option A) Use Qwen3-VL in Transformers (recommended if you want Transformers APIs)
Use the non-GGUF weights repo and the Qwen3-VL class. The model card explicitly recommends installing Transformers from source for Qwen3-VL support and shows the working code. (Hugging Face)
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model_id = "Qwen/Qwen3-VL-8B-Instruct"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id,
dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
If your reason for GGUF was low VRAM, do that on the Transformers-native checkpoint using Transformers quantization (bitsandbytes/AWQ/GPTQ) + flash_attention_2, not GGUF-in-Transformers. (Qwenâs model card explicitly calls out flash_attention_2 for memory saving.) (Hugging Face)
Option B) Use the GGUF Q8_0 you already have (recommended if you specifically want Q8_0.gguf)
Run it with llama.cpp (or compatible tools), providing both the LLM GGUF and the mmproj GGUF. The GGUF model card provides the intended usage and explains the split. (Hugging Face)
This is the GGUF-native approach that actually preserves GGUFâs practical benefits.
Option C) If you want to learn Transformers GGUF loading (but not for Qwen3-VL)
Transformers GGUF loading is done via gguf_file=..., and only for supported model families. The official docs include a working example with AutoModelForCausalLM + gguf_file. (Hugging Face)
This is useful for supported text GGUF models, but it does not solve Qwen3-VL GGUF today for the reasons above.
Direct answer to your question
- You cannot âspecify
model_type in the Python scriptâ in a way that makes AutoModel recognize Qwen/Qwen3-VL-8B-Instruct-GGUF. model_type is read from config.json, and this repo is a GGUF split-package intended for llama.cpp. (Hugging Face)
- The correct
model_type for Qwen3-VL in Transformers is qwen3_vl, but you should use it via the Transformers-native repo (Qwen/Qwen3-VL-8B-Instruct) and the proper model class. (Hugging Face)