Uploading a space on paid GPU's

kunaldigi · June 5, 2025, 6:10am

I tried to duplicate the space “ameerazam08/FLUX.1-dev-Inpainting-Model-Beta-GPU” on a paid gpu provided by hugging face every time i am getting the memory issue, when i do the same using the zerogpu it get’s done only with the paid gpu such issue arises. I tried on Nvidia T4 Small,Nvidia T4 Medium,Nvidia 1*L4,Nvidia A10G Small, Nvidia A10G LArge.

John6666 · June 5, 2025, 9:05am

That space uses FLUX without quantization, so it requires approximately 35+ GB of VRAM. In the Zero GPU Space, up to 40 GB of VRAM is available in the old version, and up to 70 GB in the new version. Therefore, to reproduce this in a metered GPU Space, at least an L40S is required.
Alternatively, you could quantize the model to save VRAM, but it seems like the code was written specifically for Diffusers before quantization was supported.

kunaldigi · June 5, 2025, 12:07pm

Do you think i should apply quantization upon it before uploading it

kunaldigi · June 5, 2025, 12:48pm

Actually i tried till

Nvidia 4xL40S

48 vCPU
•
382 GB RAM
•
192 GB VRAM

But always getting this error:-
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 44.40 GiB of which 24.31 MiB is free. Including non-PyTorch memory, this process has 44.37 GiB memory in use. Of the allocated memory 43.94 GiB is allocated by PyTorch, and 19.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (CUDA semantics — PyTorch 2.7 documentation)

John6666 · June 6, 2025, 4:02am

That’s strange. PyTorch has allocated more than 43 GB of VRAM, but the program is trying to allocate even more VRAM. The program should only need about 38 GB of VRAM for the model alone…
Of course, it may need a little more VRAM during inference.

I wonder where it’s consuming so much…

Do you think i should apply quantization upon it before uploading it

No, you can save VRAM by quantizing on the fly within the code rather than during upload.

However, in the case of this program, there are several pieces of custom code outside of Diffusers, so quantization is not recommended due to the effort required to rewrite the code.

Topic		Replies	Views
CUDA out of memory on Nvidia A10G + Codellama on HuggingFace Spaces Beginners	6	497	February 8, 2024
Why don't I have access to all the GPU's VRAM? Spaces	2	686	March 11, 2023
Cuda OOM on 4 A6000s (142 GB of VRAM) even after using Zero3, Qlora, Accelerate, Max_token_length Intermediate	1	47	May 8, 2025
Failed to Initialize Bloom-7B Due to Lack of CUDA memory Inference Endpoints on the Hub	5	801	May 30, 2023
Can't load huge model onto multiple GPU's Beginners	5	5130	June 15, 2023

Uploading a space on paid GPU's

Nvidia 4xL40S

Related topics