How long does image generation with black-forest-labs/FLUX.1-dev take?

RTQAQ · July 21, 2025, 10:56am

I run below code on a RTX 3090 with Ryzen 9 7900X and 128 GB RAM. So generating a single 512x512 image takes 20 minutes.
Is that normal? I read that it just should take seconds.

import torch
from diffusers import FluxPipeline
import sys
import time

start = time.time()
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")

pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "a wolf running"

images_ = pipe(
    prompt,
    # width=1920,
    # height=1088,
    width=512,
    height=512,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator(device="cuda").manual_seed(0)
).images

for i, image in enumerate(images_):
    image.save("flux-dev" + str(i) + ".png")

end = time.time()
print(f"Generation took {time.time() - start:.2f} seconds")

Cuda is 12.1, PYthon is 3.10
Packages (installed version | lastest version):

GitPython	3.1.44	3.1.44
MarkupSafe	2.1.5	3.0.2
PyYAML	6.0.2	6.0.2
accelerate	1.9.0	1.9.0
aiofiles	23.2.1	24.1.0
altair	5.5.0	5.5.0
annotated-types	0.7.0	0.7.0
anyio	4.9.0	4.9.0
attrs	25.3.0	25.3.0
blinker	1.9.0	1.9.0
cachetools	6.1.0	6.1.0
certifi	2025.7.14	2025.7.14
charset-normalizer	3.4.2	3.4.2
click	8.2.1	8.2.1
colorama	0.4.6	0.4.6
diffusers	0.34.0	0.34.0
einops	0.8.1	0.8.1
exceptiongroup	1.3.0	1.3.0
fastapi	0.116.1	0.116.1
ffmpy	0.6.0	0.6.0
filelock	3.18.0	3.18.0
fire	0.7.0	0.7.0
flux	0.0.post58+g1371b2b	1.3.5
fsspec	2025.7.0	2025.7.0
gitdb	4.0.12	4.0.12
gradio	5.13.2	5.38.0
gradio-client	1.6.0	1.11.0
h11	0.16.0	0.16.0
httpcore	1.0.9	1.0.9
httpx	0.28.1	0.28.1
huggingface-hub	0.33.4	0.33.4
idna	3.10	3.10
importlib-metadata	8.7.0	8.7.0
invisible-watermark	0.2.0	0.2.0
jinja2	3.1.6	3.1.6
jsonschema	4.25.0	4.25.0
jsonschema-specifications	2025.4.1	2025.4.1
markdown-it-py	3.0.0	3.0.0
mdurl	0.1.2	0.1.2
mpmath	1.3.0	1.3.0
narwhals	1.48.0	1.48.0
networkx	3.4.2	3.5
numpy	2.2.6	2.3.1
opencv-python	4.12.0.88	4.12.0.88
orjson	3.11.0	3.11.0
packaging	25.0	25.0
pandas	2.3.1	2.3.1
pillow	11.3.0	11.3.0
pip	25.1.1	25.1.1
protobuf	6.31.1	6.31.1
psutil	7.0.0	7.0.0
pyarrow	21.0.0	21.0.0
pydantic	2.11.7	2.11.7
pydantic-core	2.33.2
pydeck	0.9.1	0.9.1
pydub	0.25.1	0.25.1
pygments	2.19.2	2.19.2
python-dateutil	2.9.0.post0	2.9.0.post0
python-multipart	0.0.20	0.0.20
pytz	2025.2	2025.2
pywavelets	1.8.0	1.8.0
referencing	0.36.2	0.36.2
regex	2024.11.6	2024.11.6
requests	2.32.4	2.32.4
rich	14.0.0	14.0.0
rpds-py	0.26.0	0.26.0
ruff	0.6.8	0.12.4
safehttpx	0.1.6	0.1.6
safetensors	0.5.3	0.5.3
semantic-version	2.10.0	2.10.0
sentencepiece	0.2.0	0.2.0
setuptools	57.4.0	80.9.0
shellingham	1.5.4	1.5.4
six	1.17.0	1.17.0
smmap	5.0.2	6.0.0
sniffio	1.3.1	1.3.1
starlette	0.47.2	0.47.2
streamlit	1.47.0	1.47.0
streamlit-drawable-canvas	0.9.3	0.9.3
streamlit-keyup	0.3.0	0.3.0
sympy	1.13.1	1.14.0
tenacity	9.1.2	9.1.2
termcolor	3.1.0	3.1.0
tokenizers	0.21.2	0.21.2
toml	0.10.2	0.10.2
tomlkit	0.13.3	0.13.3
torch	2.5.1+cu121	2.7.1
torchaudio	2.5.1+cu121	2.7.1
torchvision	0.20.1+cu121	0.22.1
tornado	6.5.1	6.5.1
tqdm	4.67.1	4.67.1
transformers	4.53.2	4.53.2
typer	0.16.0	0.16.0
typing-extensions	4.14.1	4.14.1
typing-inspection	0.4.1	0.4.1
tzdata	2025.2	2025.2
urllib3	2.5.0	2.5.0
uvicorn	0.35.0	0.35.0
watchdog	6.0.0	6.0.0
websockets	14.2	15.0.1
zipp	3.23.0	3.23.0

John6666 · July 21, 2025, 11:50am

on a RTX 3090 with Ryzen 9 7900X and 128 GB RAM. So generating a single 512x512 image takes 20 minutes.
Is that normal?

Yeah. With that code, FLUX is loaded into VRAM or RAM in a 16-bit state without quantization, requiring approximately 36 GB or more. Since VRAM is insufficient, it cannot be utilized effectively, resulting in lengthy inference times. Therefore,

Reduce VRAM consumption by quantizing and store the entire model in VRAM to accelerate processing
Then optimize performance using other methods

Quantization is at least necessary. For 4-bit quantization methods, I recommend BitsAndBytes for ease of use or TorchAO for speed.
While there were various limitations when using LoRA in the past, these should be largely resolved now.

Optimization methods for FLUX:

RTQAQ · July 21, 2025, 5:08pm

Thanks for the answer. I could reduce the runtime from 20 min to 2min.
Do you see any possible improvements with my code?
I adjusted the code to:

import torch
from diffusers import FluxPipeline, DiffusionPipeline
import time, os
from diffusers.quantizers import PipelineQuantizationConfig
from datetime import datetime

start = time.time()

torch._dynamo.config.capture_dynamic_output_shape_ops = True

# quantize
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

# compile
pipeline.transformer.to(memory_format=torch.channels_last)

prompt = "a wolf running" 

images_ = pipeline(
    prompt,
    width=1920,
    height=1088,
    # width=64,
    # height=64,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator(device="cuda").manual_seed(0)).images

John6666 · July 21, 2025, 11:40pm

There are no major issues, so I think you can proceed by adding optimization methods based on that.

The specific optimization methods available will vary depending on the OS and GPU, so there’s no one-size-fits-all solution. For example, on Windows, there are a few methods that don’t work outside of WSL2…

Since the model is FLUX for this project, I recommend the ParaAttention-based optimization mentioned earlier. That alone can significantly speed things up even with a single GPU.

Additionally, combining TorchAO with torch.compile can also improve performance. TorchAO is PyTorch’s official quantization method, so it’s generally fast. However, it’s still a bit unstable in terms of behavior, and selecting the right quantization method requires some knowledge, so it may require some trial and error.

import torch
from diffusers import FluxPipeline, DiffusionPipeline
import time, os
from diffusers.quantizers import PipelineQuantizationConfig
from datetime import datetime

start = time.time()

torch._dynamo.config.capture_dynamic_output_shape_ops = True

# quantize
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

# compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.enable_model_cpu_offload() # more memory efficient way
#pipeline.transformer.compile_repeated_blocks(fullgraph=True, dynamic=True) # if you want to compile it

prompt = "a wolf running" 

images_ = pipeline(
    prompt,
    width=1920,
    height=1088,
    # width=64,
    # height=64,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator(device="cuda").manual_seed(0)).images

Optimization guides other than those listed above

GitHub - sayakpaul/diffusers-torchao: End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training). (The method you are using for quantization is the new specification for Diffusers, but this document can be useful as a reference for benchmarking and other considerations)

system · July 22, 2025, 11:40am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Flux.1-dev installation Models	1	2971	August 31, 2024
Flux Diffusers Pipeline's unusual runtime in Google colab Models	9	522	January 29, 2025
Flux.1 [schnell] is too slow Models	16	1257	December 31, 2024
Program not working on GPU but works on CPU Intermediate	24	192	June 24, 2025
How to train FLUX.1 for custom emoji generation — dataset size, script, and deployment? Models	1	43	April 8, 2025

How long does image generation with black-forest-labs/FLUX.1-dev take?

Optimization guides other than those listed above

Related topics