How to load T5 xxl GGUF with Diffusers

tao07 · June 17, 2025, 5:43pm

Like playing with image/video gen. Used to do it with ComfyUI. Amazing tool that’s reasonable easy to use. Was digging in ways to run Flux faster than ComfyUI on my Mac M1 8GB. 8GB isn’t a lot, but it’s doable, so trying to save as much memory as possible and a whole backend + advance ComfyUI frontend didn’t help increase the available RAM for the diffusion process. This is my code:

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch
import gc
import os
from safetensors.torch import load_file

torch.mps.set_per_process_memory_fraction(0.0)

def flush():
    gc.collect()
    torch.mps.empty_cache()
    gc.collect()
    torch.mps.empty_cache()

prompt = "A racing car"

main_folder = "/Users/me/Flux/model"

pipeline = FluxPipeline.from_pretrained(
    main_folder,
    transformer=None,
    vae=None,
   torch_dtype=torch.bfloat16,
).to("mps")

print('Encoding prompt')
with torch.no_grad():
   print("Encoding prompts.")
   prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
       prompt=prompt, prompt_2=prompt, max_sequence_length=256
   )

del pipeline

flush()


print('Load model')
ckpt_path = "/Users/me/ComfyUI/models/unet/flux-hyp8-Q4_0.gguf"

transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipeline = FluxPipeline.from_pretrained(
    main_folder,
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("mps")


#Rendering
print("Running denoising.")
height, width = 1024, 1024

images = pipeline(
    prompt_embeds=prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    num_inference_steps=8,
    guidance_scale=5.0,
    height=height,
    width=width,
    generator=torch.Generator("mps").manual_seed(42)
).images[0]

images.save("compile_image.png")

It works and generates images at 100s/it (instead of 300-400s/it on ComfyUI), but I would like to speed it up a bit, by using quantized T5 xxl, just as I did in ComfyUI (and would like to be able to use the same files as ComfyUI, like I have done with the UNET).

John6666 · June 18, 2025, 2:45am

Since T5EncoderModel is part of Transformers rather than Diffusers, it should be fine to load and use it as a Transformers model, but there may still be some bugs in the GGUF part.

If you don’t try to use the same file, you could load a different file using a different quantization method…
One possibility would be to first dequantize it and then quantize it on the fly in a different format?

github.com/huggingface/transformers

Add T5 GGUF loading support

main ← junejae:feature/t5-gguf

opened 01:41PM - 09 Sep 24 UTC

junejae

+197 -7

# What does this PR do? Add T5 GGUF loading support Due to the nat…ure of T5's architecture, I decided to replicate [gguf's](https://github.com/ggerganov/llama.cpp/blob/54f376d0b92c6ff6feb1fa2ef8ed2022348100ba/convert_hf_to_gguf.py#L178) conversion [logic](https://github.com/ggerganov/llama.cpp/blob/4db04784f96757d74f74c8c110c2a00d55e33514/gguf-py/gguf/tensor_mapping.py#L568), so the final code gets messy. I tried to avoid any logical conflicts between t5's and existing model architectures, but feel free to edit codes if you find any mistakes that I haven't noticed out. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. Link: https://github.com/huggingface/transformers/issues/33260 - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [x] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @SunMarc @LysandreJik @ArthurZucker , could you please review this PR?

tao07 · June 18, 2025, 3:13pm

Thanks for reply.
3 main questions comes to my mind.

How should the code look like for GGUF with transformers?
This clearly does not work.

text_encoder_2 = T5ForConditionalGeneration.from_single_file(
    '/Volumes/T7/ML/ComfyUI/models/clip/t5-v1_1-xxl-encoder-Q3_K_S.gguf',
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

What other quantize formats works on Mac (mps) with a filesize 4-5GB?
Since ComfyUI works with T5 xxl GGUF, does that mean it uses a completely different “backend” to interference the text?

John6666 · June 18, 2025, 8:56pm

1

Maybe like this (if there is a NO BUG):

text_encoder_2 = T5ForConditionalGeneration.from_pretrained(
    '/Volumes/T7/ML/ComfyUI/models/clip',
    gguf_file="t5-v1_1-xxl-encoder-Q3_K_S.gguf'",
    torch_dtype=torch.bfloat16,
)

2

How about torchao or bitsandbytes or optimum-quanto ?

gist.github.com

https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c

run_flux_with_limited_resources.md

# Running Flux under limited resources with Diffusers

Flux: https://blackforestlabs.ai/announcing-black-forest-labs/

* [Run Flux with quantization](https://gist.github.com/AmericanPresidentJimmyCarter/873985638e1f3541ba8b00137e7dacd9) by [AmericanPresidentJimmyCarter](https://gist.github.com/AmericanPresidentJimmyCarter).
* [Run Flux on a 24GB 4090 by decoupling the different stages of the pipeline](https://gist.github.com/sayakpaul/23862a2e7f5ab73dfdcc513751289bea)
* [Running with `torchao`](https://gist.github.com/sayakpaul/e1f28e86d0756d587c0b898c73822c47)
* [Running with NF4](https://github.com/huggingface/diffusers/issues/9165#issue-2462431761)

The first resource even allows you to run the pipeline under 16GBs of GPU VRAM.

This file has been truncated. show original

3

Diffusers+Transoformers, ComfyUI, and A1111 WebUI are all completely different programs. Although their purposes and results are largely the same, their implementations are different. Compatibility is provided for convenience, but it is better to convert them in advance to avoid problems.

tao07 · June 20, 2025, 9:16pm

Hello
Thanks for reply again.
I tried the GGUF text_encoder thing, but really it didn’t save that much time. It had to de-quantize it to quantize it again (it did it automatically). So the time saved on the lower size (thus less swapping in the disk), was lost to the pre-work.

John6666 · June 21, 2025, 12:37am

Hmm… Although it may defeat the purpose of using the same file, TorchAO or Quanto are recommended for speed optimization. If using the same file, apply these or bitsandbytes’ NF4 on the fly. GGUF is convenient, but offers little speed advantage other than loading.

gist.github.com

https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c

run_flux_with_limited_resources.md

# Running Flux under limited resources with Diffusers

Flux: https://blackforestlabs.ai/announcing-black-forest-labs/

* [Run Flux with quantization](https://gist.github.com/AmericanPresidentJimmyCarter/873985638e1f3541ba8b00137e7dacd9) by [AmericanPresidentJimmyCarter](https://gist.github.com/AmericanPresidentJimmyCarter).
* [Run Flux on a 24GB 4090 by decoupling the different stages of the pipeline](https://gist.github.com/sayakpaul/23862a2e7f5ab73dfdcc513751289bea)
* [Running with `torchao`](https://gist.github.com/sayakpaul/e1f28e86d0756d587c0b898c73822c47)
* [Running with NF4](https://github.com/huggingface/diffusers/issues/9165#issue-2462431761)

The first resource even allows you to run the pipeline under 16GBs of GPU VRAM.

This file has been truncated. show original

Topic		Replies	Views
Flux.1-dev installation Models	1	3045	August 31, 2024
Unable to run gguf model Models	1	1020	January 6, 2025
Loading flux from Local safetensors 🧨 Diffusers	16	3465	November 19, 2024
Fine tuning gguf models? 🤗Transformers	1	1440	April 30, 2024
How long does image generation with black-forest-labs/FLUX.1-dev take? Models	4	85	July 22, 2025

Related topics