How to train on multiple GPUs the Informer model for time series forecasting?

RafaelZequeira · August 1, 2023, 1:52pm

Hi,

I followed the following blog post to train an Informer model for Multivariate Probabilistic Time Series Forecasting:

The code works but, although it makes use of the “Accelerate” library it trains in only one GPU by default.
I would like to execute the training on a node with 8 GPUs. Could someone share how to accomplish this?

If I execute accelerate config to enable DeepSpeed, this how my configuration looks:

deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I get the following error:

ValueError: When using DeepSpeed `accelerate.prepare()` requires you to pass at least one of training or evaluation dataloaders or alternatively set an integer value in `train_micro_batch_size_per_gpu` in the deepspeed config fileor assign integer value to `AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']`.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2754113) of binary: /opt/miniconda/envs/TS/bin/python

On the other hand if I execute accelerate config to not use “DeepSpeed”, or “FullyShardedDataParallel”, or “Megatron-LM”, this is how the configuration looks like:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '[0, 1, 2, 3, 4, 5, 6, 7]'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Then I get the following error:

~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Half for the source.
Traceback (most recent call last):
  File "/home/jimenezr/coding/time-series/junos.py", line 418, in <module>
    main()
  File "/home/jimenezr/coding/time-series/junos.py", line 392, in main
    outputs = model(
              ^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 1884, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 1734, in forward
    decoder_outputs = self.decoder(
                      ^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 1459, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 855, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 662, in forward
    context[dim_for_slice, top_u_sparsity_measurement, :] = attn_output
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Maybe @kashif knows about this?
Thank you very much for your help.

kashif · August 1, 2023, 2:08pm

thanks for the report @RafaelZequeira I have not trained the time-series models on a multi-gpu setup so i would need to find some resources to be able to debug the issue… let me look into it and get back!

RafaelZequeira · August 10, 2023, 8:07am

Hi @kashif is there any update on this topic?

kashif · August 10, 2023, 8:19am

The issue is not the multi-gpu but rather the types… I assume you are training with FP16? Can you kindly try with FP32 to confirm that works?

kashif · August 10, 2023, 9:03am

Fixed the issue here: [Time series Informer] fix dtype of cumsum by kashif · Pull Request #25431 · huggingface/transformers · GitHub

RafaelZequeira · August 15, 2023, 2:06pm

Hi @kashif,
Thanks for looking into this.

I executed accelerate config and did not selected fp16 or bf16. Then I guess I’m using FP32 by default.

When I launch the script, you can see below that at least now the training starts, but immediately I’m getting the following error:

Epoch 0: : 1batch [00:02,  2.70s/batch, loss_per_batch=74.8, loss_per_epoch=74.8]
Traceback (most recent call last):
  File "/home/jimenezr/coding/time-series/junos.py", line 510, in <module>
    main()
  File "/home/jimenezr/coding/time-series/junos.py", line 463, in main
    loss = model(
           ^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 103 104 105 106 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

I followed the recommendations from this issue:

github.com/huggingface/accelerate

Expected to have finished reduction in the prior iteration before starting a new one.

opened 11:04AM - 06 Apr 21 UTC

closed 03:54PM - 06 Apr 21 UTC

rahular

I have modified the `nlp_example` to finetune an `EncoderDecoder` on translation… data like this: ``` accelerator = Accelerator(device_placement=False, fp16=args.fp16, cpu=args.cpu) def _tokenize(batch): if accelerator.distributed_type == DistributedType.TPU: src = tokenizer(batch[0], padding="max_length", max_length=128, return_tensors="pt") tgt = tokenizer(batch[1], padding="max_length", max_length=128, return_tensors="pt") else: src = tokenizer(list(batch[0]), padding="longest", return_tensors="pt") tgt = tokenizer(list(batch[1]), padding="longest", return_tensors="pt") return src, tgt ... for step, batch in train_bar: src, tgt = _tokenize(batch) src["input_ids"] = src["input_ids"].to(accelerator.device) tgt["input_ids"] = tgt["input_ids"].to(accelerator.device) outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"]) loss = outputs.loss loss = loss / gradient_accumulation_steps accelerator.backward(loss) if step % gradient_accumulation_steps == 0: optimizer.step() lr_scheduler.step() optimizer.zero_grad() if step % eval_steps == 0: model.eval() for step, batch in enumerate(dev_dataloader): src, tgt = _tokenize(batch) src["input_ids"] = src["input_ids"].to(accelerator.device) tgt["input_ids"] = tgt["input_ids"].to(accelerator.device) with torch.no_grad(): predictions = model.generate( src["input_ids"], decoder_start_token_id=tokenizer.convert_tokens_to_ids("[CLS]"), num_beams=4, repetition_penalty=1.0, do_sample=False, forced_bos_token_id=None, ) pred_str = tokenizer.batch_decode(predictions, skip_special_tokens=True) ref_str = tokenizer.batch_decode(tgt["input_ids"], skip_special_tokens=True) metric.add_batch( predictions=accelerator.gather(pred_str), references=accelerator.gather([[r] for r in ref_str]), ) eval_metric = metric.compute() ... ``` I am getting the following error during training ``` File "trainer.py", line 104, in training_function outputs = model(input_ids=src["input_ids"], decoder_input_ids=tgt["input_ids"], labels=tgt["input_ids"]) File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 606, in forward if self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). ``` and the following during generation ``` File "trainer.py", line 120, in training_function predictions = model.generate( File "/home/wjv316/anaconda3/envs/indic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__ type(self).__name__, name)) torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'generate' ``` Both are working fine if I change the configuration to use only one GPU using `accelerate config`

and used:

from accelerate import DistributedDataParallelKwargs

ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])

The training now runs with the following warnings:

Epoch 0: : 0batch [00:00, ?batch/s][W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

but after a while I get the following error:

Epoch 0: : 70batch [00:47,  1.48batch/s, loss_per_batch=14.6, loss_per_epoch=25.1]
Traceback (most recent call last):
  File "/home/jimenezr/coding/time-series/junos.py", line 510, in <module>
    main()
  File "/home/jimenezr/coding/time-series/junos.py", line 463, in main
    loss = model(
           ^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely  means that not all `forward` outputs participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 5: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ...
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630052 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630053 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630054 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630055 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630056 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630058 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630059 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 2630057) of binary: /opt/miniconda/envs/TS/bin/python
Traceback (most recent call last):
  File "/opt/miniconda/envs/TS/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
junos.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-15_17:40:14
  host      : localhost
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 2630057)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

the error is a bit random at which point it happens. Sometimes, the training goes for a full Epoch and then I get the error at some point during the 2nd epoch. I have set num_batches_per_epoch=1500. Other times, like you can see above, the error happened at the beginning of training, e.g., at epoch 0, batch 70.

I hope you could help with this issue.
Thank you very much

RafaelZequeira · August 15, 2023, 3:47pm

I would like to point out my question about using DeepSpeed on my initial post. When I enable DeepSpeed through accelerate config, I get the following error:

ValueError: When using DeepSpeed `accelerate.prepare()` requires you to pass at least one of training or evaluation dataloaders or alternatively set an integer value in `train_micro_batch_size_per_gpu` in the deepspeed config fileor assign integer value to `AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']`.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2754113) of binary: /opt/miniconda/envs/TS/bin/python

it looks to me that accelerate.prepare() expects train_dataloader to be of type PyTorch Dataloader, and in this case, after following the blog post, train_dataloader is of type: <gluonts.itertools.IterableSlice object at 0x2bb4a8ed0>.

is there a way to have the train_dataloader as a PyTorch Dataloader to use DeepSpeed?

RafaelZequeira · August 18, 2023, 1:46pm

Hi @kashif did you have time to look into this issue? Do you have any update?

Topic		Replies	Views
Accelerate Multi-GPU on several Nodes How to 🤗Accelerate	3	6278	October 13, 2021
Detecting single gpu within each node 🤗Accelerate	2	757	January 17, 2023
Multi GPU training - Model parallelism DeepSpeed	1	1886	February 2, 2024
Setup for Deepspeed Multi GPU Training DeepSpeed	2	7931	December 7, 2022
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6384	May 31, 2023

How to train on multiple GPUs the Informer model for time series forecasting?

Related topics