Offloading to CPU may or may not be possible depending on the type of quantization library, but it seems to be possible for multi-GPU use, but I’m not sure if just specifying device_map with the accelerate library will work.
If it’s still unsupported or buggy, I guess it would be like piecing together the following information to deal with it…
But well, I don’t know how people have multiple GPUs and such.
I’ve seen at least two people on the forum complain that load balancing to multiple GPUs doesn’t work properly, so you’re on the hook for that. That might be a bug.