Running inference on flan-ul2 on multi-gpu

It seems like a lot of people have also had issues running flan-ul2 on multi-gpu… I am currently trying to run it in a notebook on sagemaker with a g4dn.12xlarge that has 4T4 GPUs.

I load the model with the following since (device_map=“auto” doesn’t seem to work and gives OOM on the first GPU):

'''max_memory_mapping = {0: "8GB", 1: "8GB",2: "8GB", 3: "8GB"}
        logger.info(max_memory_mapping)
        model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-ul2",torch_dtype="auto",device_map='auto',max_memory=max_memory_mapping,load_in_8bit=True)
'''

Still, I get the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:1!”

This is my device_map:

{'shared': 0,
'lm_head': 0,
'encoder.embed_tokens': 0,
'encoder.block.0': 0,
'encoder.block.1': 0,
'encoder.block.2': 0,
'encoder.block.3': 0,
'encoder.block.4': 0,
'encoder.block.5': 0,
'encoder.block.6': 0,
'encoder.block.7': 0,
'encoder.block.8': 0,
'encoder.block.9': 0,
'encoder.block.10': 0,
'encoder.block.11': 0,
'encoder.block.12': 0,
'encoder.block.13': 0,
'encoder.block.14': 0,
'encoder.block.15': 0,
'encoder.block.16': 0,
'encoder.block.17': 0,
'encoder.block.18': 0,
'encoder.block.19': 0,
'encoder.block.20': 0,
'encoder.block.21': 0,
'encoder.block.22': 0,
'encoder.block.23': 0,
'encoder.block.24': 0,
'encoder.block.25': 0,
'encoder.block.26': 0,
'encoder.block.27': 0,
'encoder.block.28': 1,
'encoder.block.29': 1,
'encoder.block.30': 1,
'encoder.block.31': 1,
'encoder.final_layer_norm': 1,
'encoder.dropout': 1,
'decoder.embed_tokens': 1,
'decoder.block.0': 1,
'decoder.block.1': 1,
'decoder.block.2': 1,
'decoder.block.3': 1,
'decoder.block.4': 1,
'decoder.block.5': 1,
'decoder.block.6': 1,
'decoder.block.7': 1,
'decoder.block.8': 1,
'decoder.block.9': 1,
'decoder.block.10': 1,
'decoder.block.11': 1,
'decoder.block.12': 1,
'decoder.block.13': 1,
'decoder.block.14': 1,
'decoder.block.15': 1,
'decoder.block.16': 1,
'decoder.block.17': 1,
'decoder.block.18': 1,
'decoder.block.19': 1,
'decoder.block.20': 2,
'decoder.block.21': 2,
'decoder.block.22': 2,
'decoder.block.23': 2,
'decoder.block.24': 2,
'decoder.block.25': 2,
'decoder.block.26': 2,
'decoder.block.27': 2,
'decoder.block.28': 2,
'decoder.block.29': 2,
'decoder.block.30': 2,
'decoder.block.31': 2,
'decoder.final_layer_norm': 2,
'decoder.dropout': 2}

@philschmid I think you might know the answer to this given that I saw you worked a lot on deploying flan-ul2 on multi-gpu?

2 Likes

Wish I could provide a solution but just dropping in to say I’m having a nearly identical issue. I’m working on system with 16x V100s. Loading the model works fine, but I encounter the “Expected all tensors” error when attempting to run inference. Code below, for reference. Note I’ve encountered this with other models and encountered the same issue…not sure if it’s a problem with accelerate or bitsandbytes :man_shrugging:.

from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained('google/flan-ul2',
                                          cache_dir = './models')
model = T5ForConditionalGeneration.from_pretrained('google/flan-ul2',
                                                   cache_dir = './models',
                                                   device_map = 'auto',
                                                   load_in_8bit = True)
input_string = 'Answer the following question by reasoning step by step. I start with 10 bananas. A monkey eats three of them, and then gives me an avocado. How many bananas do I have left?'
inputs = tokenizer(input_string, return_tensors = 'pt').to('cuda:0')
outputs = model.generate(inputs['input_ids'], max_length = 200)
print(tokenizer.decode(outputs[0]))

Error traceback:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 3
      1 input_string = 'Answer the following question by reasoning step by step. I start with 10 bananas. A monkey eats three of them, and then gives me an avocado. How many bananas do I have left?'
      2 inputs = tokenizer(input_string, return_tensors = 'pt').to('cuda:0')
----> 3 outputs = model.generate(inputs['input_ids'], max_length = 200)
      4 print(tokenizer.decode(outputs[0]))

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/transformers/generation/utils.py:1391, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
   1385         raise ValueError(
   1386             f"num_return_sequences has to be 1, but is {generation_config.num_return_sequences} when doing"
   1387             " greedy search."
   1388         )
   1390     # 11. run greedy search
-> 1391     return self.greedy_search(
   1392         input_ids,
   1393         logits_processor=logits_processor,
   1394         stopping_criteria=stopping_criteria,
   1395         pad_token_id=generation_config.pad_token_id,
   1396         eos_token_id=generation_config.eos_token_id,
   1397         output_scores=generation_config.output_scores,
   1398         return_dict_in_generate=generation_config.return_dict_in_generate,
   1399         synced_gpus=synced_gpus,
   1400         **model_kwargs,
   1401     )
   1403 elif is_contrastive_search_gen_mode:
   1404     if generation_config.num_return_sequences > 1:

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/transformers/generation/utils.py:2179, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
   2176 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2178 # forward pass to get next token
-> 2179 outputs = self(
   2180     **model_inputs,
   2181     return_dict=True,
   2182     output_attentions=output_attentions,
   2183     output_hidden_states=output_hidden_states,
   2184 )
   2186 if synced_gpus and this_peer_finished:
   2187     continue  # don't waste resources running the code we don't need

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/accelerate/hooks.py:158, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    156         output = old_forward(*args, **kwargs)
    157 else:
--> 158     output = old_forward(*args, **kwargs)
    159 return module._hf_hook.post_forward(module, output)

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1691, in T5ForConditionalGeneration.forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1686 if self.config.tie_word_embeddings:
   1687     # Rescale output before projecting on vocab
   1688     # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
   1689     sequence_output = sequence_output * (self.model_dim**-0.5)
-> 1691 lm_logits = self.lm_head(sequence_output)
   1693 loss = None
   1694 if labels is not None:

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/accelerate/hooks.py:158, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    156         output = old_forward(*args, **kwargs)
    157 else:
--> 158     output = old_forward(*args, **kwargs)
    159 return module._hf_hook.post_forward(module, output)

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_mm)

@sgugger seems like you and some other team members were working on this issue in this transformers PR. Any advice on how we should proceed here?

@imiraoui I found a working solution from another user here. In summary, since t5 models have residual connections, device_map = 'auto' will cause inference errors because those connections get split across different GPUs. Hope you can make it work for you!

3 Likes

Hi @ZQ-Dev
Thanks a lot for linking this solution.
Do you have any idea why it uses the argument no_split_module_classes=["T5Block"] in line 14? That seems very specific to a T5 model and I’m trying to see if this solution can be generalised to other models or if it’s just very specific. Either way, it would be good to understand what it does.

Hi @AndreaSottana. So the purpose of that entire line is to build a custom device map by looking at the architecture of the model you are going to split across your GPUs. Model architectures can differ wildly, so that’s why there is no “one-size-fits-all” approach, hence the need to infer the device map by looking at the target model.

This may not be 100% accurate, but my understanding of the no_split_module_classes parameter is that it gives you the opportunity to tell infer_auto_device_map() what the primary building block of the target model is. This will ensure that these building blocks (in this case, T5 Blocks, because this model is based on the T5 architecture) will not be split across GPUs.

This isn’t really generalizable to other model architectures, but it’s not too difficult to adjust the code as needed. For example, if you were making a device map for a Bloom model, you would replace ["T5Block"] with ["BloomBlock"] as the argument and be good to go. For other models, you have to dig around in the model authors’ code (provided you have access) to figure out what they called their building blocks. A CTRL+F search for “no_split” usually does the trick ;).

Hope that helps!

1 Like

Hi @ZQ-Dev
Thanks for your answer.
I just couldn’t understand how you got to "T5Block" or "BloomBlock" from the model itself, but I figured out that it is sufficient to print the model representation or look at repr(model) and you’re able to see what the blocks are called without needing to look at the code details for each model.
Thanks again

1 Like

That’s a great tip that I wasn’t aware of! Thank you for sharing :slight_smile:

@ZQ-Dev
Any idea if this should be implementable on t5-large?

device_map['lm_head'] = device_map["decoder.embed_tokens"]

doesn’t seem to do the trick, am still getting cuda errors. Device map is:

{'shared': 0, 'decoder.embed_tokens': 0, 'encoder': 0, 'decoder.block.0': 0, 'decoder.block.1': 0, 'decoder.block.2': 0, 'decoder.block.3': 0, 'decoder.block.4': 1, 'decoder.block.5': 1, 'decoder.block.6': 1, 'decoder.block.7': 1, 'decoder.block.8': 1, 'decoder.block.9': 1, 'decoder.block.10': 1, 'decoder.block.11': 1, 'decoder.block.12': 1, 'decoder.block.13': 1, 'decoder.block.14': 1, 'decoder.block.15': 1, 'decoder.block.16': 1, 'decoder.block.17': 1, 'decoder.block.18': 1, 'decoder.block.19': 1, 'decoder.block.20': 1, 'decoder.block.21': 1, 'decoder.block.22': 1, 'decoder.block.23': 1, 'decoder.final_layer_norm': 1, 'decoder.dropout': 1, 'lm_head': 0}

What CUDA errors are you seeing? Not sure why this wouldn’t be possible with t5-large.