EDIT: What I wrote here is the right way to do CUDA parallelisation on AutoModelForCausalLM
and should work without any modification.
I am trying to load Llama-3.1-70B-Instruct
in my two 80GB NVIDIA A100 GPUs.
I can load the model fine by using 8-bit precision and device_map = 'auto'
:
self.model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Meta-Llama-3.1-70B-Instruct'
device_map = 'auto',
torch_dtype = torch.bfloat16,
load_in_8bit = True,
)
After this, the head of the device is in (I think) cuda:0
.
> self.model.hf_device_map
{
'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0,
'model.layers.35': 1, 'model.layers.36': 1, 'model.layers.37': 1, 'model.layers.38': 1, 'model.layers.39': 1, 'model.layers.40': 1, 'model.layers.41': 1, 'model.layers.42': 1, 'model.layers.43': 1, 'model.layers.44': 1, 'model.layers.45': 1, 'model.layers.46': 1, 'model.layers.47': 1, 'model.layers.48': 1, 'model.layers.49': 1, 'model.layers.50': 1, 'model.layers.51': 1, 'model.layers.52': 1, 'model.layers.53': 1, 'model.layers.54': 1, 'model.layers.55': 1, 'model.layers.56': 1, 'model.layers.57': 1, 'model.layers.58': 1, 'model.layers.59': 1, 'model.layers.60': 1, 'model.layers.61': 1, 'model.layers.62': 1, 'model.layers.63': 1, 'model.layers.64': 1, 'model.layers.65': 1, 'model.layers.66': 1, 'model.layers.67': 1, 'model.layers.68': 1, 'model.layers.69': 1, 'model.layers.70': 1, 'model.layers.71': 1, 'model.layers.72': 1, 'model.layers.73': 1, 'model.layers.74': 1, 'model.layers.75': 1, 'model.layers.76': 1, 'model.layers.77': 1, 'model.layers.78': 1, 'model.layers.79': 1, 'model.norm': 1, 'model.rotary_emb': 1,
'lm_head': 1,
}
However, moving tensors to cuda:0
and running the model still fails.
ids = ids.to('cuda:0')
masks = masks.to('cuda:0')
outputs = self.model.generate(
input_ids = ids,
attention_mask = masks,
max_new_tokens = self.max_length,
stop_strings = ['.', '\n'],
do_sample = False,
tokenizer = self.tokenizer,
output_logits = True,
return_dict_in_generate = True,
temperature = None,
top_p = None,
)
RuntimeError('Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cu
da:1! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)')
I assume that this error is happening in the edge of cuda:0
and cuda:1
in the model.
Is there any simple way to make device_map = 'auto'
work for AutoModelForCausalLM
and model.generate
? Otherwise, what’s the simplest API that I can use to run a single Llama model in parallel?