Each time I use device_map='auto'
via accelerate
I get the RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
exception.
This doesn’t occur only when I specify the appropriate no_split_module_classes
in load_checkpoint_and_dispatch
method for the model I’m working on. Is there an easy way to determine which blocks should be not split across GPUs for given model/checkpoint?