I would like to parallelize generation across GPUs, but also load the model quantized. The code below achieves task 1. How would I also incorporate loading the model in a quantized manner? from transformers import pipeline, T5ForConditionalGeneration, AutoTokenizer, BitsAndBytesConfig from accele…

How to parallelize inference on a quantized model

John6666 September 24, 2024, 10:04pm 2

Offloading to CPU may or may not be possible depending on the type of quantization library, but it seems to be possible for multi-GPU use, but I’m not sure if just specifying device_map with the accelerate library will work.
If it’s still unsupported or buggy, I guess it would be like piecing together the following information to deal with it…

But well, I don’t know how people have multiple GPUs and such.
I’ve seen at least two people on the forum complain that load balancing to multiple GPUs doesn’t work properly, so you’re on the hook for that. That might be a bug.

Topic		Replies	Views
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9567	October 16, 2024
How to avert 'loading checkpoint shards'? 🤗Transformers	4	12391	November 1, 2024
How to load model on multiple GPUs for inference? Beginners	0	726	September 28, 2023
Loading quantized model on CPU only 🤗Transformers	6	18270	February 3, 2025
T5 inference performance Models	5	1557	March 8, 2022

How to parallelize inference on a quantized model

Related topics