We’ve been experimenting with reducing cold start latency for large LLM inference without relying on warm pools.
The usual pattern for “serverless” GPU inference is either:
-
Keep the model resident 24/7 (pay idle cost), or
-
Tear everything down and fully reinitialize on each scale-up event (slow cold start)
For 70B-class models, full initialization can take tens of seconds depending on disk, graph compilation, allocator stabilization, etc.
Instead of snapshotting at the container level, we’ve been testing snapshotting at the GPU runtime level after full initialization.
What we snapshot
After the model is fully initialized and stable, we snapshot:
-
Model weights already mapped into GPU memory
-
CUDA graphs compiled and ready
-
Allocator state stabilized
-
Kernel warmup completed
This is effectively capturing a “ready-to-serve” runtime state.
What we do not snapshot
-
Active request state
-
KV cache
-
In-flight tokens
Each request still runs cleanly from a restored runtime baseline.
Result
On H100, we’re seeing restore times around ~2 seconds for a 70B model.
That’s restore of the initialized runtime, not full weight reload + graph rebuild.
For comparison, full cold initialization from scratch can take significantly longer due to:
-
CUDA context creation
-
Graph compilation
-
Memory allocation churn
-
Weight load from disk
-
Fragmentation effects
Why this is interesting
This allows a scale-to-zero style pattern for large models without:
-
Keeping warm pools alive
-
Paying continuous idle GPU cost
-
Accepting 30–60s cold starts
It also changes how you think about multi-model scheduling on a single GPU, since restoring becomes closer to a resume than a full boot.
Tradeoffs we’re exploring
-
Snapshot size vs restore latency
-
Storage overhead
-
Interaction with vLLM vs llama.cpp style runtimes
-
Behavior under frequent scale-down / scale-up cycles
-
Multi-tenant isolation considerations
We’re still testing this under different traffic patterns and would be interested in feedback from others experimenting with GPU-level snapshotting or similar techniques.
Happy to share more details or compare notes with anyone exploring similar approaches.