Cuda out of memory in SD3

I am working on a cloth swapping project, in which I have used GroundedSAM (to fetch cloth masks) and then use Stable Diffusion 3 pipeline with Controlnet 3 (supported for SD3).

The problem is while I am creating the pipeline object (pipe) for the SD3 (pre-trained) and map the pipe to my available device (“cuda”) in my case, it is throwing the error Cuda out of memory.

I have tried to ommitt the cuda map and run the next parts but it is showing that different outputs in different devices. Must be on same device. I tried the model in Google Colab (Pay as you go: 16 GB VRAM)and also in RunPod 24GB VRAM. In both the cases, the same error is generating for me. I am attaching a small reproducable code for this.

Note: RoboFlow GroundedSAM is also running in the same environment

I am also attaching a picture of my error: (This error originates after the SD3_pipeline).

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 23.64 GiB of which 2.81 MiB is free. Process 2832761 has 23.63 GiB memory in use. Of the allocated memory 22.72 GiB is allocated by PyTorch, and 456.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output is truncated.

I am using the RoboFlow GroundedSAM (only can run in Google Colab). At this situation suggests some solutions. Should I go for more powerful VRAMs like 32 GB in Runpod, or there is any other issue?

I have tried the methods like cache removal and setting cuda storage but they didn’t work.

1 Like

It seems that there is a mysterious bug specific to SD3.5 medium.

Yes. But for my case, I have GroundedSAM, Stable Diffusion 3 inpainting with ControlNet 3.

1 Like

The pipeline is mostly the same, so I think problems that occur in one place are likely to occur elsewhere as well.

That said, SD 3.5 Medium is larger than I expected. It seems that the text encoder (T5) is large. With this setup, it might be difficult to achieve the desired VRAM capacity without quantization.

I am already using the torch_dtype argument value as torch.float16, I also design the pipeline for one image at a time. Also reduces the image height and width to 640 instead of optimal 1024.

Despite these customizations, the model is not working. Plus I have GroundedSAM in the same notebook which solely requires a minimum 12-14 GB VRAM itself. It is not possible to use more than 24 GB VRAM, as it is not feasible for the clients.

However, can you give me an approximate Amount of GPU required for the complete process? (GroundedSAM + SD3 with ControlNet 3). Or can you tell me some other ways of inpainting?

1 Like