Hi,
I’m using the SAM2 model in video streaming ( SAM2 Video ). With each processed frame, GPU memory usage (torch.cuda.memory_allocated()
) increases steadily until it eventually runs out of memory (CUDA OOM).
I’ve tried:
-
Setting
max_vision_features_cache_size=1
-
Calling
reset_tracking_data()
andreset_inference_session()
periodically -
Deleting all local tensors after each frame and running
gc.collect()
+torch.cuda.empty_cache()
-
Loading frames one-by-one from disk (no large RAM usage)
Despite this, allocated
memory grows linearly with every frame, suggesting that something in the SAM2 streaming pipeline is keeping GPU tensors alive for all processed frames.
Has anyone else experienced this? Is there a known workaround to keep VRAM usage stable during long streaming inference without reloading the model each time?
— System Info —
Platform: Linux-6.16.3-76061603-generic-x86_64-with-glibc2.35
Python: 3.11.13 | packaged by conda-forge | (main, Jun 4 2025, 14:48:23) [GCC 13.3.0]
— PyTorch and CUDA Info —
PyTorch Version: 2.7.1+cu128
Is CUDA available: True
CUDA Version: 12.8
cuDNN Version: 90701
GPU Name: NVIDIA GeForce RTX 5060 Ti
— Transformers Info —
Transformers Version: 4.57.0.dev0