FluxPipeline loading finishes halfway with no error message

Hi, I’m encountering an issue while trying to use the FluxPipeline from the diffusers library. The pipeline seems to get stuck halfway when loading components, with no error message displayed. Here are the details:

Environment:

  • Windows 10
  • Python 3.12.4
  • NVIDIA GeForce GTX 1650 GPU
  • CUDA version: 11.8

Code:

import torch
from huggingface_hub import login
import os

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(torch.cuda.get_device_name(device))
print(torch.cuda.mem_get_info(device))

login(token="...")
from diffusers import FluxPipeline
print(torch.version.cuda)

pipe_cpu = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.float32
)
print('cpu')
pipe = pipe_cpu.to("cuda:0")
print("pip")
prompt = "A homeless cat holding a cardboard sign that says 'Hi Mom!'"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
print("image")
image.save("flux-dev.png")
print("save")

Output:

NVIDIA GeForce GTX 1650
(3485152052, 4294639616)
11.8
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:47<00:00, 23.54s/it]
Loading pipeline components...:  29%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  

The process gets stuck at this point with no further progress or error message. It just returns back to next prompt in cmd. It seems like it fails

Any help or guidance would be greatly appreciated. Thank you!

#diffusers #fluxpipeline #cuda

1 Like

First, since you specified 32-bit precision, this would require at least 70GB of VRAM or RAM. Next, since you are trying to load the entire model into the GPU with .to(β€œcuda:0”), this would result in an error if there was not enough free space in the VRAM.
Rewrite it as follows. I think it will work if you have about 40GB of RAM.

pip install -U accelerate
import torch
from huggingface_hub import login
import os

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(torch.cuda.get_device_name(device))
print(torch.cuda.mem_get_info(device))

login(token="...")
from diffusers import FluxPipeline
print(torch.version.cuda)

pipe_cpu = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",  torch_dtype=torch.bfloat16, device_map="auto"
)
print('cpu')
#pipe = pipe_cpu.to("cuda:0")
pipe = pipe_cpu
pipe.enable_cpu_offloading()
print("pip")
prompt = "A homeless cat holding a cardboard sign that says 'Hi Mom!'"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
print("image")
image.save("flux-dev.png")
print("save")

That will be problematic because I don’t have such computational power is there a way to do it with 16?

1 Like

It’s difficult. First of all, the size of the FLUX model exceeds 30GB with 16-bit precision. Diffusers do not support precision of less than 8-bit except for quantization, and when quantizing, the entire model must be placed on the GPU in principle, so a GPU with about 12GB of VRAM is required. The minimum line for inference with the CPU is a little over 30GB of memory.
However, if you don’t have enough RAM, the OS will usually use the HDD or SSD as virtual memory, so it may still be possible to run.
It will be slow, but I personally think that it will probably just about work even with 16GB of RAM. I don’t recommend it…

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.