I tried to duplicate the space “ameerazam08/FLUX.1-dev-Inpainting-Model-Beta-GPU” on a paid gpu provided by hugging face every time i am getting the memory issue, when i do the same using the zerogpu it get’s done only with the paid gpu such issue arises. I tried on Nvidia T4 Small,Nvidia T4 Medium,Nvidia 1*L4,Nvidia A10G Small, Nvidia A10G LArge.
That space uses FLUX without quantization, so it requires approximately 35+ GB of VRAM. In the Zero GPU Space, up to 40 GB of VRAM is available in the old version, and up to 70 GB in the new version. Therefore, to reproduce this in a metered GPU Space, at least an L40S is required.
Alternatively, you could quantize the model to save VRAM, but it seems like the code was written specifically for Diffusers before quantization was supported.
Do you think i should apply quantization upon it before uploading it
Actually i tried till
Nvidia 4xL40S
- 48 vCPU
- •
- 382 GB RAM
- •
- 192 GB VRAM
But always getting this error:-
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 44.40 GiB of which 24.31 MiB is free. Including non-PyTorch memory, this process has 44.37 GiB memory in use. Of the allocated memory 43.94 GiB is allocated by PyTorch, and 19.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.7 documentation)
That’s strange. PyTorch has allocated more than 43 GB of VRAM, but the program is trying to allocate even more VRAM. The program should only need about 38 GB of VRAM for the model alone…
Of course, it may need a little more VRAM during inference.
I wonder where it’s consuming so much…
Do you think i should apply quantization upon it before uploading it
No, you can save VRAM by quantizing on the fly within the code rather than during upload.
However, in the case of this program, there are several pieces of custom code outside of Diffusers, so quantization is not recommended due to the effort required to rewrite the code.