Running inference on flan-ul2 on multi-gpu

imiraoui · March 9, 2023, 5:39pm

It seems like a lot of people have also had issues running flan-ul2 on multi-gpu… I am currently trying to run it in a notebook on sagemaker with a g4dn.12xlarge that has 4T4 GPUs.

I load the model with the following since (device_map=“auto” doesn’t seem to work and gives OOM on the first GPU):

'''max_memory_mapping = {0: "8GB", 1: "8GB",2: "8GB", 3: "8GB"}
        logger.info(max_memory_mapping)
        model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-ul2",torch_dtype="auto",device_map='auto',max_memory=max_memory_mapping,load_in_8bit=True)
'''

Still, I get the following error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:1!”

This is my device_map:

{'shared': 0,
'lm_head': 0,
'encoder.embed_tokens': 0,
'encoder.block.0': 0,
'encoder.block.1': 0,
'encoder.block.2': 0,
'encoder.block.3': 0,
'encoder.block.4': 0,
'encoder.block.5': 0,
'encoder.block.6': 0,
'encoder.block.7': 0,
'encoder.block.8': 0,
'encoder.block.9': 0,
'encoder.block.10': 0,
'encoder.block.11': 0,
'encoder.block.12': 0,
'encoder.block.13': 0,
'encoder.block.14': 0,
'encoder.block.15': 0,
'encoder.block.16': 0,
'encoder.block.17': 0,
'encoder.block.18': 0,
'encoder.block.19': 0,
'encoder.block.20': 0,
'encoder.block.21': 0,
'encoder.block.22': 0,
'encoder.block.23': 0,
'encoder.block.24': 0,
'encoder.block.25': 0,
'encoder.block.26': 0,
'encoder.block.27': 0,
'encoder.block.28': 1,
'encoder.block.29': 1,
'encoder.block.30': 1,
'encoder.block.31': 1,
'encoder.final_layer_norm': 1,
'encoder.dropout': 1,
'decoder.embed_tokens': 1,
'decoder.block.0': 1,
'decoder.block.1': 1,
'decoder.block.2': 1,
'decoder.block.3': 1,
'decoder.block.4': 1,
'decoder.block.5': 1,
'decoder.block.6': 1,
'decoder.block.7': 1,
'decoder.block.8': 1,
'decoder.block.9': 1,
'decoder.block.10': 1,
'decoder.block.11': 1,
'decoder.block.12': 1,
'decoder.block.13': 1,
'decoder.block.14': 1,
'decoder.block.15': 1,
'decoder.block.16': 1,
'decoder.block.17': 1,
'decoder.block.18': 1,
'decoder.block.19': 1,
'decoder.block.20': 2,
'decoder.block.21': 2,
'decoder.block.22': 2,
'decoder.block.23': 2,
'decoder.block.24': 2,
'decoder.block.25': 2,
'decoder.block.26': 2,
'decoder.block.27': 2,
'decoder.block.28': 2,
'decoder.block.29': 2,
'decoder.block.30': 2,
'decoder.block.31': 2,
'decoder.final_layer_norm': 2,
'decoder.dropout': 2}

@philschmid I think you might know the answer to this given that I saw you worked a lot on deploying flan-ul2 on multi-gpu?

ZQ-Dev · March 9, 2023, 8:09pm

Wish I could provide a solution but just dropping in to say I’m having a nearly identical issue. I’m working on system with 16x V100s. Loading the model works fine, but I encounter the “Expected all tensors” error when attempting to run inference. Code below, for reference. Note I’ve encountered this with other models and encountered the same issue…not sure if it’s a problem with accelerate or bitsandbytes .

from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained('google/flan-ul2',
                                          cache_dir = './models')
model = T5ForConditionalGeneration.from_pretrained('google/flan-ul2',
                                                   cache_dir = './models',
                                                   device_map = 'auto',
                                                   load_in_8bit = True)
input_string = 'Answer the following question by reasoning step by step. I start with 10 bananas. A monkey eats three of them, and then gives me an avocado. How many bananas do I have left?'
inputs = tokenizer(input_string, return_tensors = 'pt').to('cuda:0')
outputs = model.generate(inputs['input_ids'], max_length = 200)
print(tokenizer.decode(outputs[0]))

Error traceback:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 3
      1 input_string = 'Answer the following question by reasoning step by step. I start with 10 bananas. A monkey eats three of them, and then gives me an avocado. How many bananas do I have left?'
      2 inputs = tokenizer(input_string, return_tensors = 'pt').to('cuda:0')
----> 3 outputs = model.generate(inputs['input_ids'], max_length = 200)
      4 print(tokenizer.decode(outputs[0]))

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/transformers/generation/utils.py:1391, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
   1385         raise ValueError(
   1386             f"num_return_sequences has to be 1, but is {generation_config.num_return_sequences} when doing"
   1387             " greedy search."
   1388         )
   1390     # 11. run greedy search
-> 1391     return self.greedy_search(
   1392         input_ids,
   1393         logits_processor=logits_processor,
   1394         stopping_criteria=stopping_criteria,
   1395         pad_token_id=generation_config.pad_token_id,
   1396         eos_token_id=generation_config.eos_token_id,
   1397         output_scores=generation_config.output_scores,
   1398         return_dict_in_generate=generation_config.return_dict_in_generate,
   1399         synced_gpus=synced_gpus,
   1400         **model_kwargs,
   1401     )
   1403 elif is_contrastive_search_gen_mode:
   1404     if generation_config.num_return_sequences > 1:

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/transformers/generation/utils.py:2179, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
   2176 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2178 # forward pass to get next token
-> 2179 outputs = self(
   2180     **model_inputs,
   2181     return_dict=True,
   2182     output_attentions=output_attentions,
   2183     output_hidden_states=output_hidden_states,
   2184 )
   2186 if synced_gpus and this_peer_finished:
   2187     continue  # don't waste resources running the code we don't need

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/accelerate/hooks.py:158, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    156         output = old_forward(*args, **kwargs)
    157 else:
--> 158     output = old_forward(*args, **kwargs)
    159 return module._hf_hook.post_forward(module, output)

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1691, in T5ForConditionalGeneration.forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1686 if self.config.tie_word_embeddings:
   1687     # Rescale output before projecting on vocab
   1688     # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
   1689     sequence_output = sequence_output * (self.model_dim**-0.5)
-> 1691 lm_logits = self.lm_head(sequence_output)
   1693 loss = None
   1694 if labels is not None:

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/accelerate/hooks.py:158, in add_hook_to_module.<locals>.new_forward(*args, **kwargs)
    156         output = old_forward(*args, **kwargs)
    157 else:
--> 158     output = old_forward(*args, **kwargs)
    159 return module._hf_hook.post_forward(module, output)

File ~/.conda/envs/llm_lab/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_mm)

@sgugger seems like you and some other team members were working on this issue in this transformers PR. Any advice on how we should proceed here?

ZQ-Dev · March 9, 2023, 8:51pm

@imiraoui I found a working solution from another user here. In summary, since t5 models have residual connections, device_map = 'auto' will cause inference errors because those connections get split across different GPUs. Hope you can make it work for you!

AndreaSottana · March 14, 2023, 11:10am

Hi @ZQ-Dev
Thanks a lot for linking this solution.
Do you have any idea why it uses the argument no_split_module_classes=["T5Block"] in line 14? That seems very specific to a T5 model and I’m trying to see if this solution can be generalised to other models or if it’s just very specific. Either way, it would be good to understand what it does.

ZQ-Dev · March 15, 2023, 6:48pm

Hi @AndreaSottana. So the purpose of that entire line is to build a custom device map by looking at the architecture of the model you are going to split across your GPUs. Model architectures can differ wildly, so that’s why there is no “one-size-fits-all” approach, hence the need to infer the device map by looking at the target model.

This may not be 100% accurate, but my understanding of the no_split_module_classes parameter is that it gives you the opportunity to tell infer_auto_device_map() what the primary building block of the target model is. This will ensure that these building blocks (in this case, T5 Blocks, because this model is based on the T5 architecture) will not be split across GPUs.

This isn’t really generalizable to other model architectures, but it’s not too difficult to adjust the code as needed. For example, if you were making a device map for a Bloom model, you would replace ["T5Block"] with ["BloomBlock"] as the argument and be good to go. For other models, you have to dig around in the model authors’ code (provided you have access) to figure out what they called their building blocks. A CTRL+F search for “no_split” usually does the trick ;).

Hope that helps!

AndreaSottana · March 17, 2023, 4:36pm

Hi @ZQ-Dev
Thanks for your answer.
I just couldn’t understand how you got to "T5Block" or "BloomBlock" from the model itself, but I figured out that it is sufficient to print the model representation or look at repr(model) and you’re able to see what the blocks are called without needing to look at the code details for each model.
Thanks again

ZQ-Dev · March 17, 2023, 4:48pm

That’s a great tip that I wasn’t aware of! Thank you for sharing

cassianlewis · June 1, 2023, 2:32pm

@ZQ-Dev
Any idea if this should be implementable on t5-large?

device_map['lm_head'] = device_map["decoder.embed_tokens"]

doesn’t seem to do the trick, am still getting cuda errors. Device map is:

{'shared': 0, 'decoder.embed_tokens': 0, 'encoder': 0, 'decoder.block.0': 0, 'decoder.block.1': 0, 'decoder.block.2': 0, 'decoder.block.3': 0, 'decoder.block.4': 1, 'decoder.block.5': 1, 'decoder.block.6': 1, 'decoder.block.7': 1, 'decoder.block.8': 1, 'decoder.block.9': 1, 'decoder.block.10': 1, 'decoder.block.11': 1, 'decoder.block.12': 1, 'decoder.block.13': 1, 'decoder.block.14': 1, 'decoder.block.15': 1, 'decoder.block.16': 1, 'decoder.block.17': 1, 'decoder.block.18': 1, 'decoder.block.19': 1, 'decoder.block.20': 1, 'decoder.block.21': 1, 'decoder.block.22': 1, 'decoder.block.23': 1, 'decoder.final_layer_norm': 1, 'decoder.dropout': 1, 'lm_head': 0}

ZQ-Dev · June 6, 2023, 6:35pm

What CUDA errors are you seeing? Not sure why this wouldn’t be possible with t5-large.

Topic		Replies	Views
Finetuning T5-large on Multiple GPUs 🤗Transformers	0	1081	April 26, 2023
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	28	113703	November 17, 2024
Device_map="auto" with error: Expected all tensors to be on the same device Beginners	7	6732	January 5, 2025
RuntimeError: Expected all tensors to be on the same device, but found at least two devices Beginners	0	96	November 30, 2024
Getting error when running inference in multiple GPUs 🤗Transformers	0	648	October 13, 2023

Running inference on flan-ul2 on multi-gpu

Related topics