How long does image generation with black-forest-labs/FLUX.1-dev take?

I run below code on a RTX 3090 with Ryzen 9 7900X and 128 GB RAM. So generating a single 512x512 image takes 20 minutes.
Is that normal? I read that it just should take seconds.

import torch
from diffusers import FluxPipeline
import sys
import time

start = time.time()
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "a wolf running"

images_ = pipe(
    prompt,
    # width=1920,
    # height=1088,
    width=512,
    height=512,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator(device="cuda").manual_seed(0)
).images

for i, image in enumerate(images_):
    image.save("flux-dev" + str(i) + ".png")

end = time.time()
print(f"Generation took {time.time() - start:.2f} seconds")

Cuda is 12.1, PYthon is 3.10
Packages (installed version | lastest version):

GitPython 3.1.44 3.1.44
MarkupSafe 2.1.5 3.0.2
PyYAML 6.0.2 6.0.2
accelerate 1.9.0 1.9.0
aiofiles 23.2.1 24.1.0
altair 5.5.0 5.5.0
annotated-types 0.7.0 0.7.0
anyio 4.9.0 4.9.0
attrs 25.3.0 25.3.0
blinker 1.9.0 1.9.0
cachetools 6.1.0 6.1.0
certifi 2025.7.14 2025.7.14
charset-normalizer 3.4.2 3.4.2
click 8.2.1 8.2.1
colorama 0.4.6 0.4.6
diffusers 0.34.0 0.34.0
einops 0.8.1 0.8.1
exceptiongroup 1.3.0 1.3.0
fastapi 0.116.1 0.116.1
ffmpy 0.6.0 0.6.0
filelock 3.18.0 3.18.0
fire 0.7.0 0.7.0
flux 0.0.post58+g1371b2b 1.3.5
fsspec 2025.7.0 2025.7.0
gitdb 4.0.12 4.0.12
gradio 5.13.2 5.38.0
gradio-client 1.6.0 1.11.0
h11 0.16.0 0.16.0
httpcore 1.0.9 1.0.9
httpx 0.28.1 0.28.1
huggingface-hub 0.33.4 0.33.4
idna 3.10 3.10
importlib-metadata 8.7.0 8.7.0
invisible-watermark 0.2.0 0.2.0
jinja2 3.1.6 3.1.6
jsonschema 4.25.0 4.25.0
jsonschema-specifications 2025.4.1 2025.4.1
markdown-it-py 3.0.0 3.0.0
mdurl 0.1.2 0.1.2
mpmath 1.3.0 1.3.0
narwhals 1.48.0 1.48.0
networkx 3.4.2 3.5
numpy 2.2.6 2.3.1
opencv-python 4.12.0.88 4.12.0.88
orjson 3.11.0 3.11.0
packaging 25.0 25.0
pandas 2.3.1 2.3.1
pillow 11.3.0 11.3.0
pip 25.1.1 25.1.1
protobuf 6.31.1 6.31.1
psutil 7.0.0 7.0.0
pyarrow 21.0.0 21.0.0
pydantic 2.11.7 2.11.7
pydantic-core 2.33.2
pydeck 0.9.1 0.9.1
pydub 0.25.1 0.25.1
pygments 2.19.2 2.19.2
python-dateutil 2.9.0.post0 2.9.0.post0
python-multipart 0.0.20 0.0.20
pytz 2025.2 2025.2
pywavelets 1.8.0 1.8.0
referencing 0.36.2 0.36.2
regex 2024.11.6 2024.11.6
requests 2.32.4 2.32.4
rich 14.0.0 14.0.0
rpds-py 0.26.0 0.26.0
ruff 0.6.8 0.12.4
safehttpx 0.1.6 0.1.6
safetensors 0.5.3 0.5.3
semantic-version 2.10.0 2.10.0
sentencepiece 0.2.0 0.2.0
setuptools 57.4.0 80.9.0
shellingham 1.5.4 1.5.4
six 1.17.0 1.17.0
smmap 5.0.2 6.0.0
sniffio 1.3.1 1.3.1
starlette 0.47.2 0.47.2
streamlit 1.47.0 1.47.0
streamlit-drawable-canvas 0.9.3 0.9.3
streamlit-keyup 0.3.0 0.3.0
sympy 1.13.1 1.14.0
tenacity 9.1.2 9.1.2
termcolor 3.1.0 3.1.0
tokenizers 0.21.2 0.21.2
toml 0.10.2 0.10.2
tomlkit 0.13.3 0.13.3
torch 2.5.1+cu121 2.7.1
torchaudio 2.5.1+cu121 2.7.1
torchvision 0.20.1+cu121 0.22.1
tornado 6.5.1 6.5.1
tqdm 4.67.1 4.67.1
transformers 4.53.2 4.53.2
typer 0.16.0 0.16.0
typing-extensions 4.14.1 4.14.1
typing-inspection 0.4.1 0.4.1
tzdata 2025.2 2025.2
urllib3 2.5.0 2.5.0
uvicorn 0.35.0 0.35.0
watchdog 6.0.0 6.0.0
websockets 14.2 15.0.1
zipp 3.23.0 3.23.0
1 Like

on a RTX 3090 with Ryzen 9 7900X and 128 GB RAM. So generating a single 512x512 image takes 20 minutes.
Is that normal?

Yeah. With that code, FLUX is loaded into VRAM or RAM in a 16-bit state without quantization, requiring approximately 36 GB or more. Since VRAM is insufficient, it cannot be utilized effectively, resulting in lengthy inference times. Therefore,

  1. Reduce VRAM consumption by quantizing and store the entire model in VRAM to accelerate processing
  2. Then optimize performance using other methods

Quantization is at least necessary. For 4-bit quantization methods, I recommend BitsAndBytes for ease of use or TorchAO for speed.
While there were various limitations when using LoRA in the past, these should be largely resolved now.

Optimization methods for FLUX:

1 Like

Thanks for the answer. I could reduce the runtime from 20 min to 2min.
Do you see any possible improvements with my code?
I adjusted the code to:

import torch
from diffusers import FluxPipeline, DiffusionPipeline
import time, os
from diffusers.quantizers import PipelineQuantizationConfig
from datetime import datetime

start = time.time()

torch._dynamo.config.capture_dynamic_output_shape_ops = True

# quantize
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

# compile
pipeline.transformer.to(memory_format=torch.channels_last)

prompt = "a wolf running" 

images_ = pipeline(
    prompt,
    width=1920,
    height=1088,
    # width=64,
    # height=64,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator(device="cuda").manual_seed(0)).images
1 Like

There are no major issues, so I think you can proceed by adding optimization methods based on that.

The specific optimization methods available will vary depending on the OS and GPU, so there’s no one-size-fits-all solution. For example, on Windows, there are a few methods that don’t work outside of WSL2…

Since the model is FLUX for this project, I recommend the ParaAttention-based optimization mentioned earlier. That alone can significantly speed things up even with a single GPU.

Additionally, combining TorchAO with torch.compile can also improve performance. TorchAO is PyTorch’s official quantization method, so it’s generally fast. However, it’s still a bit unstable in terms of behavior, and selecting the right quantization method requires some knowledge, so it may require some trial and error.:sweat_smile:

import torch
from diffusers import FluxPipeline, DiffusionPipeline
import time, os
from diffusers.quantizers import PipelineQuantizationConfig
from datetime import datetime

start = time.time()

torch._dynamo.config.capture_dynamic_output_shape_ops = True

# quantize
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

# compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.enable_model_cpu_offload() # more memory efficient way
#pipeline.transformer.compile_repeated_blocks(fullgraph=True, dynamic=True) # if you want to compile it

prompt = "a wolf running" 

images_ = pipeline(
    prompt,
    width=1920,
    height=1088,
    # width=64,
    # height=64,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator(device="cuda").manual_seed(0)).images

Optimization guides other than those listed above

GitHub - sayakpaul/diffusers-torchao: End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training). (The method you are using for quantization is the new specification for Diffusers, but this document can be useful as a reference for benchmarking and other considerations)

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.