How to load T5 xxl GGUF with Diffusers

Like playing with image/video gen. Used to do it with ComfyUI. Amazing tool that’s reasonable easy to use. Was digging in ways to run Flux faster than ComfyUI on my Mac M1 8GB. 8GB isn’t a lot, but it’s doable, so trying to save as much memory as possible and a whole backend + advance ComfyUI frontend didn’t help increase the available RAM for the diffusion process. This is my code:

from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
import torch
import gc
import os
from safetensors.torch import load_file

torch.mps.set_per_process_memory_fraction(0.0)

def flush():
    gc.collect()
    torch.mps.empty_cache()
    gc.collect()
    torch.mps.empty_cache()

prompt = "A racing car"

main_folder = "/Users/me/Flux/model"

pipeline = FluxPipeline.from_pretrained(
    main_folder,
    transformer=None,
    vae=None,
   torch_dtype=torch.bfloat16,
).to("mps")

print('Encoding prompt')
with torch.no_grad():
   print("Encoding prompts.")
   prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
       prompt=prompt, prompt_2=prompt, max_sequence_length=256
   )

del pipeline

flush()


print('Load model')
ckpt_path = "/Users/me/ComfyUI/models/unet/flux-hyp8-Q4_0.gguf"

transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipeline = FluxPipeline.from_pretrained(
    main_folder,
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    transformer=transformer,
    torch_dtype=torch.bfloat16,
).to("mps")


#Rendering
print("Running denoising.")
height, width = 1024, 1024

images = pipeline(
    prompt_embeds=prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    num_inference_steps=8,
    guidance_scale=5.0,
    height=height,
    width=width,
    generator=torch.Generator("mps").manual_seed(42)
).images[0]

images.save("compile_image.png")

It works and generates images at 100s/it (instead of 300-400s/it on ComfyUI), but I would like to speed it up a bit, by using quantized T5 xxl, just as I did in ComfyUI (and would like to be able to use the same files as ComfyUI, like I have done with the UNET).

1 Like

Since T5EncoderModel is part of Transformers rather than Diffusers, it should be fine to load and use it as a Transformers model, but there may still be some bugs in the GGUF part.

If you don’t try to use the same file, you could load a different file using a different quantization method…
One possibility would be to first dequantize it and then quantize it on the fly in a different format?

Thanks for reply.
3 main questions comes to my mind.

  1. How should the code look like for GGUF with transformers?
    This clearly does not work.
text_encoder_2 = T5ForConditionalGeneration.from_single_file(
    '/Volumes/T7/ML/ComfyUI/models/clip/t5-v1_1-xxl-encoder-Q3_K_S.gguf',
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)
  1. What other quantize formats works on Mac (mps) with a filesize 4-5GB?

  2. Since ComfyUI works with T5 xxl GGUF, does that mean it uses a completely different “backend” to interference the text?

1 Like

1

Maybe like this (if there is a NO BUG):

text_encoder_2 = T5ForConditionalGeneration.from_pretrained(
    '/Volumes/T7/ML/ComfyUI/models/clip',
    gguf_file="t5-v1_1-xxl-encoder-Q3_K_S.gguf'",
    torch_dtype=torch.bfloat16,
)

2

How about torchao or bitsandbytes or optimum-quanto ?

3

Diffusers+Transoformers, ComfyUI, and A1111 WebUI are all completely different programs. Although their purposes and results are largely the same, their implementations are different. Compatibility is provided for convenience, but it is better to convert them in advance to avoid problems.:sweat_smile: