Iβm running into an issue where the subprocess running on my second node is timing out after 900 seconds and Iβm trying to figure out how to debug the processes launched by pdsh on the remote node. I have found some information on how to attach pdb to remote processes that have already launched, but Iβm wondering if anyone working on accelerate has a way they like to do it rather than have to go through half a dozen examples drug up from StackOverflow until I find one that works.
opened 12:56PM - 25 Feb 23 UTC
### System Info
```Shell
- `Accelerate` version: 0.16.0
- Platform: Linux-5β¦ .14.0-1051-oem-x86_64-with-glibc2.31
- Python version: 3.10.9
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.13.1 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- dynamo_backend: NO
- num_processes: 2
- machine_rank: 0
- num_machines: 2
- main_process_ip: 192.168.5.5
- main_process_port: 2333
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'deepspeed_multinode_launcher': 'standard', 'gradient_accumulation_steps': 0, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
```
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [X] My own task or dataset (give details below)
### Reproduction
Steps to reproduce the behavior:
1. Buy two Dell workstations with a single A6000 each
2. Enable passwordless ssh between the two
3. Install miniconda on both machines and create mirrored virtual environments on both
4. Clone [peft](https://github.com/huggingface/peft)
5. Run `accelerate config --config_file <config_out.yaml>` on the main node
6. scp this configuration onto the second node
7. Run `accelerate launch --config_file <config_out.yaml> examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py`
### Expected behavior
```Shell
The process on the second node does not killed by launch.py after 15 minutes
```
Seems to be related to this issue
opened 01:22PM - 31 Dec 21 UTC
closed 12:31PM - 11 Jan 22 UTC
Hi everyone,
we run into a timeout when we evaluate for more than 30 minutes β¦ on a single GPU. Is there a way to tell the other GPU to wait until the main GPU completes the evaluation?
```
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802157 milliseconds before timing out.
Traceback (most recent call last):
Traceback (most recent call last):
File "scripts/training_test.py", line 172, in <module>
File "scripts/training_test.py", line 172, in <module>
main(args)
main(args) File "scripts/training_test.py", line 167, in main
File "scripts/training_test.py", line 167, in main
train(args.config)
File "scripts/training_test.py", line 148, in train
train(args.config)
File "scripts/training_test.py", line 148, in train
trainer.train_pipeline()
File "/home/azureuser/mytrainer.py", line 182, in train_pipeline
trainer.train_pipeline()
File "/home/azureuser/mytrainer.py", line 182, in train_pipeline
for step, batch in enumerate(pbar):for step, batch in enumerate(pbar):
File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/tqdm/std.py", line 1168, in __iter__
File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/tqdm/std.py", line 1168, in __iter__
for obj in iterable:for obj in iterable:
File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/data_loader.py", line 301, in __iter__
File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/data_loader.py", line 301, in __iter__
synchronize_rng_states(self.rng_types, self.generator)synchronize_rng_states(self.rng_types, self.generator)
File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 110, in synchronize_rng_states
File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 110, in synchronize_rng_states
synchronize_rng_state(RNGType(rng_type), generator=generator)synchronize_rng_state(RNGType(rng_type), generator=generator)
File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 105, in synchronize_rng_state
File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 105, in synchronize_rng_state
generator.set_state(rng_state)generator.set_state(rng_state)
RuntimeErrorRuntimeError: : Invalid mt19937 stateInvalid mt19937 state
^MEvaluating ... : 47%|ββββββββββββββββββββββββββββ | 2550/5411 [30:02<29:18, 1.63it/s][E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout
(ms)=1800000) ran for 1802593 milliseconds before timing out.
^MEvaluating ... : 47%|ββββββββββββββββββββββββββββ | 2551/5411 [30:02<29:01, 1.64it/s][E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout
(ms)=1800000) ran for 1802971 milliseconds before timing out.
```
@sgugger Can you please have a look?
Itβs been a couple of weeks since I raised the issue and Iβm not hearing anything back so I wondered if anyone here might be able to help.