[SOLVED] What's the right way to do GPU paralellism for inference (not training) on AutoModelForCausalLM?

EDIT: What I wrote here is the right way to do CUDA parallelisation on AutoModelForCausalLM and should work without any modification.


I am trying to load Llama-3.1-70B-Instruct in my two 80GB NVIDIA A100 GPUs.

I can load the model fine by using 8-bit precision and device_map = 'auto':

self.model = AutoModelForCausalLM.from_pretrained(                                            
    'meta-llama/Meta-Llama-3.1-70B-Instruct'                                                                          
    device_map = 'auto',                                                                      
    torch_dtype = torch.bfloat16,                                                             
    load_in_8bit = True,                                                                      
)

After this, the head of the device is in (I think) cuda:0.

> self.model.hf_device_map
{
	'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0,
	'model.layers.35': 1, 'model.layers.36': 1, 'model.layers.37': 1, 'model.layers.38': 1, 'model.layers.39': 1, 'model.layers.40': 1, 'model.layers.41': 1, 'model.layers.42': 1, 'model.layers.43': 1, 'model.layers.44': 1, 'model.layers.45': 1, 'model.layers.46': 1, 'model.layers.47': 1, 'model.layers.48': 1, 'model.layers.49': 1, 'model.layers.50': 1, 'model.layers.51': 1, 'model.layers.52': 1, 'model.layers.53': 1, 'model.layers.54': 1, 'model.layers.55': 1, 'model.layers.56': 1, 'model.layers.57': 1, 'model.layers.58': 1, 'model.layers.59': 1, 'model.layers.60': 1, 'model.layers.61': 1, 'model.layers.62': 1, 'model.layers.63': 1, 'model.layers.64': 1, 'model.layers.65': 1, 'model.layers.66': 1, 'model.layers.67': 1, 'model.layers.68': 1, 'model.layers.69': 1, 'model.layers.70': 1, 'model.layers.71': 1, 'model.layers.72': 1, 'model.layers.73': 1, 'model.layers.74': 1, 'model.layers.75': 1, 'model.layers.76': 1, 'model.layers.77': 1, 'model.layers.78': 1, 'model.layers.79': 1, 'model.norm': 1, 'model.rotary_emb': 1,
	'lm_head': 1,
}

However, moving tensors to cuda:0 and running the model still fails.

ids = ids.to('cuda:0')
masks = masks.to('cuda:0')
outputs = self.model.generate(                                                        
    input_ids = ids,                                                                      
    attention_mask = masks,                                                               
    max_new_tokens = self.max_length,                                                     
    stop_strings = ['.', '\n'],                                                           
    do_sample = False,                                                                    
    tokenizer = self.tokenizer,                                                       
    output_logits = True,                                                                 
    return_dict_in_generate = True,                                                       
    temperature = None,                                                                   
    top_p = None,                                                                         
)

RuntimeError('Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cu
da:1! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)')

I assume that this error is happening in the edge of cuda:0 and cuda:1 in the model.

Is there any simple way to make device_map = 'auto' work for AutoModelForCausalLM and model.generate? Otherwise, what’s the simplest API that I can use to run a single Llama model in parallel?

My bad! What I posted was actually the correct way to do GPU paralellism and should work.

When I was running this I didn’t realise I had left a lost self.model = self.model.to('cuda') in my codebase, that was probably messing with the controller or something like that.