SAM2 video streaming – VRAM usage keeps increasing until OOM

Hi,

I’m using the SAM2 model in video streaming ( SAM2 Video ). With each processed frame, GPU memory usage (torch.cuda.memory_allocated()) increases steadily until it eventually runs out of memory (CUDA OOM).

I’ve tried:

  • Setting max_vision_features_cache_size=1

  • Calling reset_tracking_data() and reset_inference_session() periodically

  • Deleting all local tensors after each frame and running gc.collect() + torch.cuda.empty_cache()

  • Loading frames one-by-one from disk (no large RAM usage)

Despite this, allocated memory grows linearly with every frame, suggesting that something in the SAM2 streaming pipeline is keeping GPU tensors alive for all processed frames.

Has anyone else experienced this? Is there a known workaround to keep VRAM usage stable during long streaming inference without reloading the model each time?

— System Info —
Platform: Linux-6.16.3-76061603-generic-x86_64-with-glibc2.35
Python: 3.11.13 | packaged by conda-forge | (main, Jun 4 2025, 14:48:23) [GCC 13.3.0]

— PyTorch and CUDA Info —
PyTorch Version: 2.7.1+cu128
Is CUDA available: True
CUDA Version: 12.8
cuDNN Version: 90701
GPU Name: NVIDIA GeForce RTX 5060 Ti

— Transformers Info —
Transformers Version: 4.57.0.dev0

1 Like

It seems to be a known issue specific to SAM2

1 Like

Hey, thanks a ton for such a detailed reply! Really appreciate you breaking down the possible causes and sharing concrete steps to try. I’ll give your suggestions a go and report back once I’ve tested them out. :raising_hands:

Thanks again for the help! :raising_hands:

1 Like

Hey,

Indeed, moving the tensors to the CPU — specifically by setting video_storage_device=“cpu” and inference_state_device=“cpu” — helped a lot. I can now segment significantly more video frames before running out of memory.

However, I’m still facing one more issue: the segmentation occasionally loses accuracy, and the masks start to develop “holes.” This seems to happen in cycles — after a short while, the segmentation stabilizes again, and then the problem repeats. I suspect it might be related to how the tracking state is updated over time, so I’ll experiment with fine‑tuning the tracking parameters to see if it helps.

If anyone has experienced similar cyclic accuracy drops in SAM2 video streaming, I’d be very interested to hear what worked for you.

Thanks again for your help!

1 Like

and the masks start to develop “holes.”

I was able to reproduce it in the Colab environment. It seems fixable.

1 Like

Hey, I can’t thank you enough for your comprehensive answer and the time you put into it. The explanation and the code you provided are incredibly helpful.

I now fully understand that this is a result of temporal memory drift and greedy memory updates. I will implement the stability-gated memory updates and short-window voting, and I’ll experiment with the other methods you suggested.

Thank you again for your amazing support. It’s fantastic to see such dedication in this community. :clap:

1 Like