Bug with model.generate if max_length or max_new_tokens are set, with accelerate deepspeed zero level 3

JulesGM · February 21, 2023, 3:17am

model.generate fails if max_length or max_new_tokens are set, with accelerate deepspeed zero level 3.

I use transformers.T5ModelForConditionalGeneration with google/t5-flan-* model, on a DGXA100 node (usually).

It seems that when a process finishes generating before the others (which almost always happens), the others get stuck waiting for it forever in a barrier. I was wondering if that was a known issue.

Everything works fine if each process has the same inputs, which makes sense as all processes finish at the same time. Somehow, everything also works fine if no value is passed for max_new_tokens or max_length and the default value of 20 is used.

accelerate               0.16.0
deepspeed                0.8.1
pytorch-triton           2.0.0+c8bfe3f548
torch                    1.12.1+cu113
torchaudio               0.12.1+cu113
torchtyping              0.1.4
torchvision              0.13.1+cu113
transformers             4.26.1

JulesGM · February 21, 2023, 3:19am

The end of the error message. You can see that rank 1 generates (and prints) its text, then rank 0 breaks at a barrier. When there are more processes, the same happens, one rank finishes first, then they all break at group._allgather_base(output_tensor, input_tensor), line 2136 of torch/distributed/distributed_c10d.py


[02/20/23 22:13:14] INFO     [1/2] __main__ - batch['input_ids'].shape = torch.Size([3, 238])                     test_accelerate.py:97
                    INFO     [1/2] __main__ - <accelerate.data_loader.DataLoaderShard object at 0x7f3c80115a00>   test_accelerate.py:99
                    INFO     [1/2] __main__ - {'input_ids': torch.Size([3, 238]), 'attention_mask':              test_accelerate.py:103
                             torch.Size([3, 238])}
                    INFO     [1/2] __main__ - dict_keys(['input_ids', 'attention_mask'])                         test_accelerate.py:105
                    INFO     [1/2] __main__ - max_new_tokens = 100                                               
[02/20/23 22:13:18] INFO     [1/2] __main__ - torch.Size([3, 31])                                                test_accelerate.py:113
                    INFO     [1/2] __main__ -   GENERATED TEXT:                                                              test_accelerate.py:114
                                     - amet labore voluptatem consectetur aliquam quiquia.</s>
                                     -  Sit adipisci neque tempora amet ipsum tempora aliquam.</s>
                                     -  etincidunt</s>
[E ProcessGroupGloo.cpp:2791] [Rank 0]: Rank 1 failed to pass monitoredBarrier in 1800000 ms
[E ProcessGroupGloo.cpp:136] [Rank 0]: Ranks 1 failed to pass monitoredBarrier in 1800000 ms
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/mila/g/gagnonju/Marg-Li-CoT/with_trlx/test_accelerate.py:121 in <module>                   │
│                                                                                                  │
│   118                                                                                            │
│   119                                                                                            │
│   120 if __name__ == "__main__":                                                                 │
│ ❱ 121 │   fire.Fire(main)                                                                        │
│   122                                                                                            │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/fire/core.py:141 in Fire                 │
│                                                                                                  │
│   138 │   context.update(caller_globals)                                                         │
│   139 │   context.update(caller_locals)                                                          │
│   140                                                                                            │
│ ❱ 141   component_trace = _Fire(component, args, parsed_flag_args, context, name)                │
│   142                                                                                            │
│   143   if component_trace.HasError():                                                           │
│   144 │   _DisplayError(component_trace)                                                         │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/fire/core.py:475 in _Fire                │
│                                                                                                  │
│   472 │     is_class = inspect.isclass(component)                                                │
│   473 │                                                                                          │
│   474 │     try:                                                                                 │
│ ❱ 475 │   │   component, remaining_args = _CallAndUpdateTrace(                                   │
│   476 │   │   │   component,                                                                     │
│   477 │   │   │   remaining_args,                                                                │
│   478 │   │   │   component_trace,                                                               │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/fire/core.py:691 in _CallAndUpdateTrace  │
│                                                                                                  │
│   688 │   loop = asyncio.get_event_loop()                                                        │
│   689 │   component = loop.run_until_complete(fn(*varargs, **kwargs))                            │
│   690   else:                                                                                    │
│ ❱ 691 │   component = fn(*varargs, **kwargs)                                                     │
│   692                                                                                            │
│   693   if treatment == 'class':                                                                 │
│   694 │   action = trace.INSTANTIATED_CLASS                                                      │
│                                                                                                  │
│ /home/mila/g/gagnonju/Marg-Li-CoT/with_trlx/test_accelerate.py:109 in main                       │
│                                                                                                  │
│   106 │   LOGGER.info(f"{max_new_tokens = }")                                                    │
│   107 │   a9r.wait_for_everyone()                                                                │
│   108 │   with torch.no_grad():                                                                  │
│ ❱ 109 │   │   output = model.generate(                                                           │
│   110 │   │   │   **batch,                                                                       │
│   111 │   │   │   max_length=max_new_tokens,                                                     │
│   112 │   │   )                                                                                  │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in        │
│ decorate_context                                                                                 │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/transformers/generation/utils.py:1391 in │
│ generate                                                                                         │
│                                                                                                  │
│   1388 │   │   │   │   )                                                                         │
│   1389 │   │   │                                                                                 │
│   1390 │   │   │   # 11. run greedy search                                                       │
│ ❱ 1391 │   │   │   return self.greedy_search(                                                    │
│   1392 │   │   │   │   input_ids,                                                                │
│   1393 │   │   │   │   logits_processor=logits_processor,                                        │
│   1394 │   │   │   │   stopping_criteria=stopping_criteria,                                      │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/transformers/generation/utils.py:2179 in │
│ greedy_search                                                                                    │
│                                                                                                  │
│   2176 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2177 │   │   │                                                                                 │
│   2178 │   │   │   # forward pass to get next token                                              │
│ ❱ 2179 │   │   │   outputs = self(                                                               │
│   2180 │   │   │   │   **model_inputs,                                                           │
│   2181 │   │   │   │   return_dict=True,                                                         │
│   2182 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/torch/nn/modules/module.py:1137 in       │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1134 │   │   │   full_backward_hooks, non_full_backward_hooks = self._get_backward_hooks()     │
│   1135 │   │   if _global_forward_pre_hooks or self._forward_pre_hooks:                          │
│   1136 │   │   │   for hook in (*_global_forward_pre_hooks.values(), *self._forward_pre_hooks.v  │
│ ❱ 1137 │   │   │   │   result = hook(self, input)                                                │
│   1138 │   │   │   │   if result is not None:                                                    │
│   1139 │   │   │   │   │   if not isinstance(result, tuple):                                     │
│   1140 │   │   │   │   │   │   result = (result,)                                                │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:9 in wrapped_fn  │
│                                                                                                  │
│    6 │   function call."""                                                                       │
│    7 │   def wrapped_fn(*args, **kwargs):                                                        │
│    8 │   │   get_accelerator().range_push(func.__qualname__)                                     │
│ ❱  9 │   │   ret_val = func(*args, **kwargs)                                                     │
│   10 │   │   get_accelerator().range_pop()                                                       │
│   11 │   │   return ret_val                                                                      │
│   12                                                                                             │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload │
│ .py:348 in _pre_forward_module_hook                                                              │
│                                                                                                  │
│   345 │   │                                                                                      │
│   346 │   │   @instrument_w_nvtx                                                                 │
│   347 │   │   def _pre_forward_module_hook(module, *args):                                       │
│ ❱ 348 │   │   │   self.pre_sub_module_forward_function(module)                                   │
│   349 │   │                                                                                      │
│   350 │   │   @instrument_w_nvtx                                                                 │
│   351 │   │   def _post_forward_module_hook(module, input, output):                              │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in        │
│ decorate_context                                                                                 │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload │
│ .py:478 in pre_sub_module_forward_function                                                       │
│                                                                                                  │
│   475 │   │   param_coordinator.trace_prologue(sub_module)                                       │
│   476 │   │   if param_coordinator.is_record_trace():                                            │
│   477 │   │   │   param_coordinator.record_module(sub_module)                                    │
│ ❱ 478 │   │   param_coordinator.fetch_sub_module(sub_module)                                     │
│   479 │   │                                                                                      │
│   480 │   │   see_memory_usage(                                                                  │
│   481 │   │   │   f"Before sub module function {sub_module.__class__.__name__} after fetch",     │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:9 in wrapped_fn  │
│                                                                                                  │
│    6 │   function call."""                                                                       │
│    7 │   def wrapped_fn(*args, **kwargs):                                                        │
│    8 │   │   get_accelerator().range_push(func.__qualname__)                                     │
│ ❱  9 │   │   ret_val = func(*args, **kwargs)                                                     │
│   10 │   │   get_accelerator().range_pop()                                                       │
│   11 │   │   return ret_val                                                                      │
│   12                                                                                             │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in        │
│ decorate_context                                                                                 │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param │
│ _coordinator.py:349 in fetch_sub_module                                                          │
│                                                                                                  │
│   346 │   │   │   │                                                                              │
│   347 │   │   │   │   for param in params_to_prefetch:                                           │
│   348 │   │   │   │   │   debug_rank0(f"-prefetch: {param.ds_summary()}")                        │
│ ❱ 349 │   │   │   │   self.__all_gather_params(params_to_prefetch)                               │
│   350 │   │   │   │                                                                              │
│   351 │   │   │   │   if self.__prefetch_nvme:                                                   │
│   352 │   │   │   │   │   self.__prefetch_nvme_param_partitions()                                │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:9 in wrapped_fn  │
│                                                                                                  │
│    6 │   function call."""                                                                       │
│    7 │   def wrapped_fn(*args, **kwargs):                                                        │
│    8 │   │   get_accelerator().range_push(func.__qualname__)                                     │
│ ❱  9 │   │   ret_val = func(*args, **kwargs)                                                     │
│   10 │   │   get_accelerator().range_pop()                                                       │
│   11 │   │   return ret_val                                                                      │
│   12                                                                                             │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param │
│ _coordinator.py:399 in __all_gather_params                                                       │
│                                                                                                  │
│   396 │   │                                                                                      │
│   397 │   │   if partitioned_params:                                                             │
│   398 │   │   │   with get_accelerator().stream(self.__allgather_stream):                        │
│ ❱ 399 │   │   │   │   handle = partitioned_params[0].all_gather_coalesced(partitioned_params)    │
│   400 │   │   │                                                                                  │
│   401 │   │   │   for param in partitioned_params:                                               │
│   402 │   │   │   │   assert param.ds_status == ZeroParamStatus.INFLIGHT, param.ds_summary()     │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:9 in wrapped_fn  │
│                                                                                                  │
│    6 │   function call."""                                                                       │
│    7 │   def wrapped_fn(*args, **kwargs):                                                        │
│    8 │   │   get_accelerator().range_push(func.__qualname__)                                     │
│ ❱  9 │   │   ret_val = func(*args, **kwargs)                                                     │
│   10 │   │   get_accelerator().range_pop()                                                       │
│   11 │   │   return ret_val                                                                      │
│   12                                                                                             │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_paramet │
│ ers.py:876 in all_gather_coalesced                                                               │
│                                                                                                  │
│    873 │   │   │   │   │   for p in params                                                       │
│    874 │   │   │   │   ],                                                                        │
│    875 │   │   │   │   │   │   │   │   │   │   │    out=partitions[self.rank])                   │
│ ❱  876 │   │   │   │   handle = _dist_allgather_fn(partitions[self.rank],                        │
│    877 │   │   │   │   │   │   │   │   │   │   │   flat_tensor,                                  │
│    878 │   │   │   │   │   │   │   │   │   │   │   self.ds_process_group)                        │
│    879                                                                                           │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_paramet │
│ ers.py:43 in _dist_allgather_fn                                                                  │
│                                                                                                  │
│     40                                                                                           │
│     41                                                                                           │
│     42 def _dist_allgather_fn(input_tensor: Tensor, output_tensor: Tensor, group=None):          │
│ ❱   43 │   return instrument_w_nvtx(dist.allgather_fn)(output_tensor,                            │
│     44 │   │   │   │   │   │   │   │   │   │   │   │   input_tensor,                             │
│     45 │   │   │   │   │   │   │   │   │   │   │   │   group=group,                              │
│     46 │   │   │   │   │   │   │   │   │   │   │   │   async_op=True)                            │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:9 in wrapped_fn  │
│                                                                                                  │
│    6 │   function call."""                                                                       │
│    7 │   def wrapped_fn(*args, **kwargs):                                                        │
│    8 │   │   get_accelerator().range_push(func.__qualname__)                                     │
│ ❱  9 │   │   ret_val = func(*args, **kwargs)                                                     │
│   10 │   │   get_accelerator().range_pop()                                                       │
│   11 │   │   return ret_val                                                                      │
│   12                                                                                             │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/comm/comm.py:340 in            │
│ allgather_fn                                                                                     │
│                                                                                                  │
│   337 │   global has_warned_all_gather                                                           │
│   338 │   assert cdb is not None and cdb.is_initialized(), 'DeepSpeed backend not set, please    │
│   339 │   if cdb.has_allgather_base:                                                             │
│ ❱ 340 │   │   return all_gather_base(output_tensor,                                              │
│   341 │   │   │   │   │   │   │      input_tensor,                                               │
│   342 │   │   │   │   │   │   │      group=group,                                                │
│   343 │   │   │   │   │   │   │      async_op=async_op,                                          │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/comm/comm.py:127 in            │
│ log_wrapper                                                                                      │
│                                                                                                  │
│   124 │   │   │   │   timers(log_name).start()                                                   │
│   125 │   │   # Return the op, then stop the op's timer                                          │
│   126 │   │   try:                                                                               │
│ ❱ 127 │   │   │   return func(*args, **kwargs)                                                   │
│   128 │   │   finally:                                                                           │
│   129 │   │   │   if comms_logger.enabled:                                                       │
│   130 │   │   │   │   # Need to make op blocking for accurate logging                            │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/comm/comm.py:318 in            │
│ all_gather_base                                                                                  │
│                                                                                                  │
│   315 │   │   │   │   │   log_name='all_gather_base',                                            │
│   316 │   │   │   │   │   debug=get_caller_func()):                                              │
│   317 │   global cdb                                                                             │
│ ❱ 318 │   return cdb.all_gather_base(output_tensor=output_tensor,                                │
│   319 │   │   │   │   │   │   │      input_tensor=tensor,                                        │
│   320 │   │   │   │   │   │   │      group=group,                                                │
│   321 │   │   │   │   │   │   │      async_op=async_op)                                          │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/deepspeed/comm/torch.py:83 in            │
│ all_gather_base                                                                                  │
│                                                                                                  │
│    80 │                                                                                          │
│    81 │   def all_gather_base(self, output_tensor, input_tensor, group=None, async_op=False):    │
│    82 │   │   if self.has_allgather_base:                                                        │
│ ❱  83 │   │   │   return torch.distributed.distributed_c10d._all_gather_base(                    │
│    84 │   │   │   │   output_tensor=output_tensor,                                               │
│    85 │   │   │   │   input_tensor=input_tensor,                                                 │
│    86 │   │   │   │   group=group,                                                               │
│                                                                                                  │
│ /home/mila/g/gagnonju/.main/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:21 │
│ 36 in _all_gather_base                                                                           │
│                                                                                                  │
│   2133 │   │   default_pg = _get_default_group()                                                 │
│   2134 │   │   work = default_pg._allgather_base(output_tensor, input_tensor)                    │
│   2135 │   else:                                                                                 │
│ ❱ 2136 │   │   work = group._allgather_base(output_tensor, input_tensor)                         │
│   2137 │                                                                                         │
│   2138 │   if async_op:                                                                          │
│   2139 │   │   return work                                                                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: [Rank 0]: Ranks 1 failed to pass monitoredBarrier in 1800000 ms
[22:13:22] ERROR    failed (exitcode: 1) local_rank: 0 (pid: 622774) of binary: /home/mila/g/gagnonju/.main/bin/python

muellerzr · February 21, 2023, 11:54am

cc @smangrul

smangrul · February 21, 2023, 1:03pm

add synced_gpus=True to model.generate() params. Check the DeepSpeed section of Launch Configuration tab in the interactive example code explorer tool for more details: Learning how to incorporate Accelerate features quickly! (huggingface.co)

samar-inception · November 7, 2024, 10:59pm

hello. Just wanted to follow up here. I’m seeing an error with Zero3 when I use max_length but not when I use max_new_tokens. I’m using transformers==4.46.2 and accelerate=1.1.1and deepspeed=0.15.3. The error I see is that at the final generation step, the input_ids are of length max_len, but the attention mask and cache position are of length max_len + 1. I don’t see this error with max_new_tokens, and toggling synced_gpus doesn’t make a difference. Do you know what may be happening?

Topic		Replies	Views
Accelerate Distributed Randomly Hangs 🤗Accelerate	0	88	September 11, 2024
[E ProcessGroupNCCL.cpp:828] [Rank X] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3634, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800429 milliseconds before timing out 🤗Accelerate	5	6394	July 31, 2023
Accelerate FSDP training \|\| RuntimeError : Forward oder differ across ranks 🤗Accelerate	0	474	December 19, 2023
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 🤗Accelerate	1	763	May 31, 2024
2B Model Fill Up Memory Usage on 4xA100s 🤗Transformers	1	103	April 10, 2025

Bug with model.generate if max_length or max_new_tokens are set, with accelerate deepspeed zero level 3

Related topics