I followed the following blog post to train an Informer model for Multivariate Probabilistic Time Series Forecasting:
The code works but, although it makes use of the “Accelerate” library it trains in only one GPU by default.
I would like to execute the training on a node with 8 GPUs. Could someone share how to accomplish this?
If I execute accelerate config to enable DeepSpeed, this how my configuration looks:
ValueError: When using DeepSpeed `accelerate.prepare()` requires you to pass at least one of training or evaluation dataloaders or alternatively set an integer value in `train_micro_batch_size_per_gpu` in the deepspeed config fileor assign integer value to `AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']`.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2754113) of binary: /opt/miniconda/envs/TS/bin/python
On the other hand if I execute accelerate config to not use “DeepSpeed”, or “FullyShardedDataParallel”, or “Megatron-LM”, this is how the configuration looks like:
~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and Half for the source.
Traceback (most recent call last):
File "/home/jimenezr/coding/time-series/junos.py", line 418, in <module>
main()
File "/home/jimenezr/coding/time-series/junos.py", line 392, in main
outputs = model(
^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/utils/operations.py", line 569, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 1884, in forward
outputs = self.model(
^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 1734, in forward
decoder_outputs = self.decoder(
^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 1459, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 855, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/transformers/models/informer/modeling_informer.py", line 662, in forward
context[dim_for_slice, top_u_sparsity_measurement, :] = attn_output
~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Maybe @kashif knows about this?
Thank you very much for your help.
thanks for the report @zequeiraj I have not trained the time-series models on a multi-gpu setup so i would need to find some resources to be able to debug the issue… let me look into it and get back!
I executed accelerate config and did not selected fp16 or bf16. Then I guess I’m using FP32 by default.
When I launch the script, you can see below that at least now the training starts, but immediately I’m getting the following error:
Epoch 0: : 1batch [00:02, 2.70s/batch, loss_per_batch=74.8, loss_per_epoch=74.8]
Traceback (most recent call last):
File "/home/jimenezr/coding/time-series/junos.py", line 510, in <module>
main()
File "/home/jimenezr/coding/time-series/junos.py", line 463, in main
loss = model(
^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 103 104 105 106 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
I followed the recommendations from this issue:
and used:
from accelerate import DistributedDataParallelKwargs
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])
The training now runs with the following warnings:
Epoch 0: : 0batch [00:00, ?batch/s][W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
but after a while I get the following error:
Epoch 0: : 70batch [00:47, 1.48batch/s, loss_per_batch=14.6, loss_per_epoch=25.1]
Traceback (most recent call last):
File "/home/jimenezr/coding/time-series/junos.py", line 510, in <module>
main()
File "/home/jimenezr/coding/time-series/junos.py", line 463, in main
loss = model(
^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely means that not all `forward` outputs participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 5: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630052 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630053 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630054 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630055 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630056 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630058 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2630059 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 5 (pid: 2630057) of binary: /opt/miniconda/envs/TS/bin/python
Traceback (most recent call last):
File "/opt/miniconda/envs/TS/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/TS/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
junos.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-15_17:40:14
host : localhost
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 2630057)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
the error is a bit random at which point it happens. Sometimes, the training goes for a full Epoch and then I get the error at some point during the 2nd epoch. I have set num_batches_per_epoch=1500. Other times, like you can see above, the error happened at the beginning of training, e.g., at epoch 0, batch 70.
I hope you could help with this issue.
Thank you very much
I would like to point out my question about using DeepSpeed on my initial post. When I enable DeepSpeed through accelerate config, I get the following error:
ValueError: When using DeepSpeed `accelerate.prepare()` requires you to pass at least one of training or evaluation dataloaders or alternatively set an integer value in `train_micro_batch_size_per_gpu` in the deepspeed config fileor assign integer value to `AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu']`.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2754113) of binary: /opt/miniconda/envs/TS/bin/python
it looks to me that accelerate.prepare() expects train_dataloader to be of type PyTorch Dataloader, and in this case, after following the blog post, train_dataloader is of type: <gluonts.itertools.IterableSlice object at 0x2bb4a8ed0>.
is there a way to have the train_dataloader as a PyTorch Dataloader to use DeepSpeed?