Hello,
With 4 NVIDIA GeForce RTX 5060 Ti graphics cards with 8GB VRAM and Wan2.2 model, how many seconds of video can I produce from a photo?
Thank you.
Hello,
With 4 NVIDIA GeForce RTX 5060 Ti graphics cards with 8GB VRAM and Wan2.2 model, how many seconds of video can I produce from a photo?
Thank you.
It’s best to assume that the length of a video generated in a single run will be within 3 to 5 seconds.
On a PC with 4 × NVIDIA GeForce RTX 5060 Ti (8 GB VRAM each) and Wan2.2 series models, running locally (ComfyUI / Diffusers):
For a single image→video clip at “normal” quality:
Adding more GPUs (4 cards instead of 1) does not increase the length of a single clip.
So, per one generation “from a photo”, your realistic target is:
One Wan2.2 clip = about 3–5 seconds of video.
Four GPUs = more of these 3–5s clips at once, not one huge 20–30s clip.
Wan2.2 is designed around short clips. The key references all line up:
The official Wan2.2 TI2V-5B card states that this model generates 720p video at 24 fps, and benchmarks it as producing a 5-second 720p clip on a single consumer GPU. (modelscope.cn)
Wiro and other frontends describe the same 5B model as supporting both text-to-video and image-to-video at 720p/24fps. (wiro.ai)
Scenario’s official Wan2.2 help page says plainly:
InstaSD’s Wan2.2 guide reinforces this:
Fal’s Wan 2.2 API guide benchmarks specifically 5-second, 720p @ 24 fps clips (TI2V-5B ≈ 9 minutes per 5 s on an RTX 4090; 14B models on 8-GPU clusters). (blog.fal.ai)
In other words:
The architecture and training regime of Wan2.2 are tuned around:
Duration is simply:
[
\text{seconds} = \frac{\text{frame count}}{\text{fps}}
]
Examples:
Going much beyond this (more frames) is possible but leaves the “recommended” zone and tends to degrade quality or stability.
You have 4 × 5060 Ti 8 GB. Important facts:
Each GPU has its own 8 GB VRAM
VRAM does not simply add up to “32 GB” for a single Wan2.2 job. Out of the box, ComfyUI / Diffusers run each workflow on one GPU at a time.
Wan2.2 TI2V-5B is already optimized for 8 GB
The official ComfyUI Wan2.2 tutorial states:
The ModelScope card notes that TI2V-5B can generate a 5-second 720p video on a consumer GPU without special optimization, implying that 8–12 GB cards can run it with offload. (modelscope.cn)
Chimolog’s Wan2.2 GPU benchmarks focus on 5-second clips and show:
Tests in ComfyUI using Wan2.2 360p / 480p / 720p with real workflows (including the popular Kijai/EasyWan22 pipeline). (It’s a little tight.)
For 720p 5-second clips with that Kijai workflow, cards with ≤12 GB VRAM generally failed (OOM), and they conclude:
This tells you:
8 GB cards can absolutely run Wan2.2, especially at 480p or lighter settings.
For full 5s @ 720p using heavier Kijai/EasyWan22 workflows, 8 GB is borderline; you may need to:
Four GPUs = four lanes
In practice, on your PC:
Advanced multi-GPU sharding (FSDP / DeepSpeed or ComfyUI-MultiGPU / DisTorch) can spread Wan2.2/14B across multiple GPUs + RAM, but this mainly lets you:
On your 8 GB 5060 Ti:
Safe, everyday settings:
At 720p, with careful offload (Comfy native):
Still target ~3–5 seconds (e.g., 81–120 frames @ 24 fps), but:
Per clip, you do not exceed ~5 seconds comfortably; for longer content you chain clips.
Specs from the 14B model cards:
On your hardware:
The per-clip duration is still ~3–5 seconds:
Question:
“With 4 × RTX 5060 Ti (8 GB VRAM) and Wan2.2 series, how many seconds of video can I produce from a photo?”
Answer, in practical terms:
Per single Wan2.2 clip (local, from one photo):
Realistic, recommended range: about 3–5 seconds of video
Per GPU:
With 4 GPUs together:
You still get 3–5 seconds per clip, but you can generate:
To make a longer video (for example, 60 seconds), you:
So the clear, easy planning rule for your setup is:
Think: 3–5 seconds of video per Wan2.2 clip from a photo, per GPU.
Your 4 GPUs multiply how many of those 3–5s clips you can produce,
but they don’t extend a single clip beyond the model’s short-clip design.