NVIDIA GeForce RTX 5060 Ti and Wan2.2 model

Hello,
With 4 NVIDIA GeForce RTX 5060 Ti graphics cards with 8GB VRAM and Wan2.2 model, how many seconds of video can I produce from a photo?

Thank you.

1 Like

It’s best to assume that the length of a video generated in a single run will be within 3 to 5 seconds.


1. Direct answer (what you can realistically expect)

On a PC with 4 × NVIDIA GeForce RTX 5060 Ti (8 GB VRAM each) and Wan2.2 series models, running locally (ComfyUI / Diffusers):

  • For a single image→video clip at “normal” quality:

    • Plan on about 3–5 seconds of video per clip.
    • This holds for the main Wan2.2 video models (TI2V-5B, T2V-A14B, I2V-A14B) at reasonable resolutions (480p–720p).
  • Adding more GPUs (4 cards instead of 1) does not increase the length of a single clip.

    • Instead, you can render several 3–5 second clips in parallel (up to 4 at once).
    • Longer final videos (30–60+ seconds) are made by stitching many short clips in an editor.

So, per one generation “from a photo”, your realistic target is:

One Wan2.2 clip = about 3–5 seconds of video.
Four GPUs = more of these 3–5s clips at once, not one huge 20–30s clip.


2. Why the limit is 3–5 seconds per clip (model-level constraints)

Wan2.2 is designed around short clips. The key references all line up:

  • The official Wan2.2 TI2V-5B card states that this model generates 720p video at 24 fps, and benchmarks it as producing a 5-second 720p clip on a single consumer GPU. (modelscope.cn)

  • Wiro and other frontends describe the same 5B model as supporting both text-to-video and image-to-video at 720p/24fps. (wiro.ai)

  • Scenario’s official Wan2.2 help page says plainly:

    • “Models work best with clips under 5 seconds in length, with optimal results using 120 frames or fewer” at 480p or 720p. (Scenario)
  • InstaSD’s Wan2.2 guide reinforces this:

    • “Wan2.2 works best with clips no longer than 5 seconds. Frame count ≤ 120 works well; 24 fps for cinematic, 16 fps for tests.” (instasd.com)
  • Fal’s Wan 2.2 API guide benchmarks specifically 5-second, 720p @ 24 fps clips (TI2V-5B ≈ 9 minutes per 5 s on an RTX 4090; 14B models on 8-GPU clusters). (blog.fal.ai)

In other words:

  • The architecture and training regime of Wan2.2 are tuned around:

    • Resolution: 480p–720p
    • Frame rate: ~24 fps (16 fps for cheaper tests)
    • Frame count: up to ~120 frames
  • Duration is simply:

[
\text{seconds} = \frac{\text{frame count}}{\text{fps}}
]

Examples:

  • 81 frames @ 24 fps ≈ 3.4 seconds
  • 120 frames @ 24 fps = 5.0 seconds
  • 80 frames @ 16 fps = 5.0 seconds

Going much beyond this (more frames) is possible but leaves the “recommended” zone and tends to degrade quality or stability.


3. What your GPUs actually change (and what they don’t)

You have 4 × 5060 Ti 8 GB. Important facts:

  1. Each GPU has its own 8 GB VRAM
    VRAM does not simply add up to “32 GB” for a single Wan2.2 job. Out of the box, ComfyUI / Diffusers run each workflow on one GPU at a time.

  2. Wan2.2 TI2V-5B is already optimized for 8 GB

    • The official ComfyUI Wan2.2 tutorial states:

      • “The Wan2.2 5B version should fit well on 8 GB VRAM with the ComfyUI native offloading.” (Comfy Docs)
    • The ModelScope card notes that TI2V-5B can generate a 5-second 720p video on a consumer GPU without special optimization, implying that 8–12 GB cards can run it with offload. (modelscope.cn)

  3. Chimolog’s Wan2.2 GPU benchmarks focus on 5-second clips and show:

    • Tests in ComfyUI using Wan2.2 360p / 480p / 720p with real workflows (including the popular Kijai/EasyWan22 pipeline). (It’s a little tight.)

    • For 720p 5-second clips with that Kijai workflow, cards with ≤12 GB VRAM generally failed (OOM), and they conclude:

      • To stably generate 5 seconds at 720p in that specific workflow, you realistically need at least an RTX 5060 Ti 16 GB or similar. (It’s a little tight.)

    This tells you:

    • 8 GB cards can absolutely run Wan2.2, especially at 480p or lighter settings.

    • For full 5s @ 720p using heavier Kijai/EasyWan22 workflows, 8 GB is borderline; you may need to:

      • Lower resolution, or
      • Shorten clips, or
      • Use the heavier Comfy “native” offload mode that prioritizes fitting over speed.
  4. Four GPUs = four lanes
    In practice, on your PC:

    • Per clip: Wan2.2 still behaves like it’s on an 8 GB card → 3–5 seconds.
    • Per machine: you can run four such clips in parallel (one per GPU), or run them sequentially to build longer videos.

Advanced multi-GPU sharding (FSDP / DeepSpeed or ComfyUI-MultiGPU / DisTorch) can spread Wan2.2/14B across multiple GPUs + RAM, but this mainly lets you:

  • Run bigger models (14B) or higher resolution,
  • Not stretch a single clip much beyond the ~5-second temporal window.

4. How this breaks down by Wan2.2 variant on your 4× 5060 Ti

4.1 Wan2.2 TI2V-5B (the main “from a photo” model)

  • Specs: 5B dense model, 720p @ 24 fps, unified text+image→video. (filtrix.ai)
  • Designed to generate up to ~5-second clips at that resolution on consumer GPUs. (modelscope.cn)

On your 8 GB 5060 Ti:

  • Safe, everyday settings:

    • 480p, 16–24 fps, 49–81 frames → about 3–5 seconds.
    • This matches Chimolog’s analysis, which shows VRAM usage staying reasonable between 49 and 81 frames at “HD-ish” resolution and explicitly calls 3–5 seconds (49–81 frames) the recommended length. (It’s a little tight.)
  • At 720p, with careful offload (Comfy native):

    • Still target ~3–5 seconds (e.g., 81–120 frames @ 24 fps), but:

      • Expect slower render times than on a 4090, and
      • Use aggressive VRAM-saving options (split VAE, model offload, etc.).

Per clip, you do not exceed ~5 seconds comfortably; for longer content you chain clips.

4.2 Wan2.2 T2V-A14B / I2V-A14B (MoE 14B series)

  • Specs from the 14B model cards:

    • 14B active parameters, MoE.
    • Video at 480p and 720p, also used as 5-second benchmark in official docs and Fal’s guide. (Hugging Face)

On your hardware:

  • Running 14B naively on a single 8 GB is not realistic; it wants much more VRAM or multi-GPU. (Hugging Face)
  • With quantization (GGUF) + multi-GPU sharding (ComfyUI-MultiGPU, FSDP/Ulysses), you can make it fit and run at 480p.

The per-clip duration is still ~3–5 seconds:

  • The 14B series is benchmarked on 5-second 720p/480p clips, just like 5B. (blog.fal.ai)
  • The extra capacity (14B vs 5B, or multi-GPU) mainly buys quality, detail, or resolution, not longer per-clip duration.

4.3 Special hosted variants (e.g. Wan2.2-Fun-Control)

  • Some cloud-hosted Wan2.2 variants like Wan2.2-Fun-Control advertise up to 120s at 720p because they run on large multi-GPU servers and use specialized pipelines. (wavespeed.ai)
  • If you call those APIs from your PC, your local GPUs don’t limit clip length—the provider does.
  • For local ComfyUI / Diffusers on your 4× 5060 Ti, you should still think in terms of 3–5 seconds per clip.

5. Putting it all together for your question

Question:
“With 4 × RTX 5060 Ti (8 GB VRAM) and Wan2.2 series, how many seconds of video can I produce from a photo?”

Answer, in practical terms:

  1. Per single Wan2.2 clip (local, from one photo):

    • Realistic, recommended range: about 3–5 seconds of video

      • 480p or 720p,
      • 16–24 fps,
      • ~49–81 (or up to ~120) frames depending on fps and workflow.
  2. Per GPU:

    • Each 5060 Ti 8 GB behaves like one 3–5-second Wan2.2 lane.
    • The clip length is set by model design and frame count, not the number of GPUs.
  3. With 4 GPUs together:

    • You still get 3–5 seconds per clip, but you can generate:

      • 4 clips in parallel, or
      • Many clips one after another.
    • To make a longer video (for example, 60 seconds), you:

      • Generate ~12 clips of 5 seconds each (using all 4 GPUs to speed this up),
      • Then stitch those clips in an editor.

So the clear, easy planning rule for your setup is:

Think: 3–5 seconds of video per Wan2.2 clip from a photo, per GPU.
Your 4 GPUs multiply how many of those 3–5s clips you can produce,
but they don’t extend a single clip beyond the model’s short-clip design.