Hi everyone, We are setting up a distributed inference pipeline using a Prefill-Decode disaggregation topology to mitigate the local memory I/O bound. The prefill runs on a remote high-compute node, and the decoding runs on a local edge node. If we deploy an INT4 quantized model (e.g., AWQ or GPTQ) on the edge node, does the incoming KV Cache from the remote prefill node strictly need to be quantized into the same format before transmission? Or can the quantized attention layers on the decode node natively accept an FP16 KV cache tensor transferred via RPC without significant overhead? Any insights on managing this quantization mismatch in split inference would be highly appreciated.
Hmm… It probably depends on the backend, but there’s usually no need to match the KV cache format to the model weight format.
No. In the usual AWQ and GPTQ deployments, the KV cache does not need to be quantized into the same format as the model weights. The important distinction is that AWQ/GPTQ are primarily weight-quantization schemes, while KV-cache precision is a separate storage/runtime choice. Hugging Face documents AWQ and GPTQ as model quantization algorithms, and TensorRT-LLM’s quantization code treats W4A16_AWQ and W4A16_GPTQ as weight-only modes with quantize_activations=False, while kv_cache_quant_algo is a separate setting. (Hugging Face)
Why this matters in Prefill-Decode disaggregation
Prefill-decode disaggregation exists because the two phases stress the hardware differently. DistServe and TensorRT-LLM both describe prefill as the prompt-processing phase that computes KV once, while decode is the token-by-token phase that repeatedly reads KV and is more sensitive to memory behavior and interference. vLLM’s disaggregated-prefill docs make the same split explicit: one instance does prefill, another does decode, and a connector transfers the prefill KV cache between them. (arXiv)
What the decode node actually cares about
A quantized decode node does not usually care whether the KV cache was “AWQ-formatted” or “GPTQ-formatted,” because those are not the normal cache formats. What it cares about is whether the incoming KV matches the cache contract that its runtime expects:
- cache dtype such as FP16, BF16, FP8, or INT8
- cache layout and paging/block layout
- any scaling metadata required by quantized KV
- the same positional encoding and model semantics. (vLLM)
Can an INT4 AWQ/GPTQ decode worker accept FP16 KV directly
Usually, yes. In vLLM, --dtype half is recommended for AWQ, and --kv-cache-dtype auto uses the model dtype. That means an AWQ model commonly runs with FP16 activations and FP16 KV unless you explicitly choose a different cache dtype. SGLang documents the same pattern: --dtype half is recommended for AWQ, and --kv-cache-dtype auto uses the model data type. That is why an INT4 weight-only decode model can normally consume FP16 KV natively. (vLLM)
So does the KV need to be quantized before transmission
Only if the decode-side KV cache format requires it. If the decode worker is configured for FP16 or BF16 KV, then sending FP16 or BF16 KV from prefill is the natural path. If the decode worker is configured for FP8 KV, then incoming FP16 KV is no longer the native format. In that case, either the prefill side should emit FP8 KV directly, or the decode side must quantize on ingest. TensorRT-LLM’s build docs explicitly expose KV-cache quantization separately, and both vLLM and SGLang document FP8 KV as a separate feature with separate scale handling. (NVIDIA Docs)
Where the real overhead lives
If the decode runtime is already using FP16 KV, there is usually little extra compute-path overhead from accepting FP16 KV. The bigger cost is transport and storage. NVIDIA’s inference optimization guide gives the standard KV-size formula and shows that for a Llama-2-7B-style model at 4096 tokens in FP16, batch size 1, the KV cache is about 2 GB. That means FP16 KV is often fine for correctness and native consumption, but it can be expensive to move over RPC and expensive to keep on the edge device. (NVIDIA Developer)
Why people still quantize KV
KV quantization exists because KV is persistent and large. Hugging Face’s cache docs say quantizing KV can significantly reduce memory requirements, but it comes at a speed cost. vLLM says FP8 KV can significantly reduce memory footprint, store more tokens, improve throughput, and support longer context windows. So KV quantization is usually a memory/bandwidth optimization, not a requirement imposed by AWQ or GPTQ. (Hugging Face)
The hidden trap: same architecture is not the same as same artifact
There are two different scenarios:
-
Same model artifact on prefill and decode, different hardware roles.
This is the normal and safest setup. Then the main problem is just cache transfer compatibility. (vLLM) -
Full-precision model on prefill, separately quantized AWQ/GPTQ model on decode.
This is more than a dtype mismatch. It starts to look like cross-model KV reuse. Recent research exists precisely because sharing KV across different model realizations is not something to assume for free. DroidSpeak studies KV reuse across different LLMs with the same architecture, and PrefillShare explicitly factorizes a shared prefill module and tuned decode modules so the shared KV remains usable. (arXiv)
That does not mean your mixed-artifact setup cannot work. It means you should treat it as an empirical compatibility question, not as a guaranteed property of “same architecture + same tensor shape.” (arXiv)
Practical guidance for your setup
For a first deployment, the lowest-risk design is:
- use the same AWQ or GPTQ artifact on both prefill and decode
- keep KV as FP16 or BF16 end to end
- verify correctness and latency first
- only move KV to FP8 if network bytes or edge KV memory become the bottleneck. (vLLM)
If you later switch to FP8 KV, do it deliberately. vLLM documents three scale strategies for FP8 KV, with dataset calibration recommended for best accuracy. SGLang says the KV scaling-factor file should generally be supplied for FP8 KV, otherwise scales default to 1.0 and accuracy may suffer. (vLLM)
What “without significant overhead” means in practice
If the decode worker is configured for FP16 KV, then native consumption overhead is usually small. The large overhead is often networking, not attention math. PyTorch’s September 2025 writeup on disaggregated inference with vLLM says their KV connector transfers KV in parallel with model execution, using separate streams and threads to avoid GPU-op contention, but it also reports that the network can become the bottleneck over TCP under heavier load and that multi-stream transfer was needed to saturate bandwidth. (PyTorch)
Bottom line
Your local INT4 AWQ/GPTQ decode node can usually accept an FP16 KV cache directly. The incoming KV cache does not need to be quantized into the same AWQ/GPTQ format as the weights. The real compatibility boundary is the decode runtime’s KV-cache format and layout, not the weight quantizer. The reason to quantize KV before transmission is bandwidth or memory pressure, not because AWQ/GPTQ demand it. The one major caveat is that if your remote prefill uses a different model artifact from the local decode node, then you are no longer dealing with a simple precision mismatch. You are testing a form of cross-model KV reuse, which should be validated carefully. (NVIDIA GitHub)