How to parallelize inference on a quantized model

I would like to parallelize generation across GPUs, but also load the model quantized.

The code below achieves task 1. How would I also incorporate loading the model in a quantized manner?

from transformers import pipeline, T5ForConditionalGeneration, AutoTokenizer, BitsAndBytesConfig
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

checkpoint = 'google/flan-ul2'

with init_empty_weights():
    model_loaded = T5ForConditionalGeneration.from_pretrained(checkpoint)
    
model_comb = load_checkpoint_and_dispatch(
                model_loaded, 
                device_map='cuda',
                no_split_module_classes=["T5Block"]
            )

tokenizer = AutoTokenizer.from_pretrained(model)

generator = pipeline('text2text-generation', model=model_comb, tokenizer = tokenizer)

Offloading to CPU may or may not be possible depending on the type of quantization library, but it seems to be possible for multi-GPU use, but I’m not sure if just specifying device_map with the accelerate library will work.
If it’s still unsupported or buggy, I guess it would be like piecing together the following information to deal with it…

But well, I don’t know how people have multiple GPUs and such.
I’ve seen at least two people on the forum complain that load balancing to multiple GPUs doesn’t work properly, so you’re on the hook for that. That might be a bug.

Thanks for the response @John6666 . So I figured out from this post that just using the regular syntax with ‘auto’ does parallelize GPUs automatically. The relevant part is at the bottom of the page.

I’m still new at that, so interestingly enough out of the 4 GPUs I have, 1 of them is not getting any utilization at all.

The HF manual consists of its introduction and a type that is automatically generated from the library. (They extract what’s written as comments in the code.)
The introduction often contains theoretical ideals and information that was correct at the time it was written but is now incorrect, so ultimately it is quicker to read the library code or watch and steal the work of others who have done it well.
It would be easiest if it could be fixed by updating the library…

But when it comes to multi-GPUs, few people use them on HF’s Spaces, so if it’s buggy, you’ll have to do it manually with torch.
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

Thanks. Is the transformers implementation buggy though? I can’t tell one way or the other.

Transformers itself have a long history and there are not many bugs in their implementation. I’m not saying there aren’t.
The problem is when they are combined with libraries for quantization and accelerate.
accelerate is doing advanced things, and therefore bugs are likely to occur.
The quantization libraries are still being updated frequently, so they are bug prone.

The fastest workaround on the user side is to try using a different library for quantization?
The next best thing is to find versions of the libraries that works well and fix them. The last resort is to go all manual with torch.
Recently, torchao, the official torch quantization, has been released, so you might try this.